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PRIMERS FOR IDENTIFYING TYPING OR 
CLASSIFYING NUCLEIC ACIDS 

5 

DNA-seqoence analysis is rapidly becoming a standard tool 
in modern, molecular biology research. Examples of applications include: 
Sequencing of unknown DNA-sequences. Identifying novel genes in 
stretches of sequenced DNA, Predicting protein-sequence and -structure 

10 from DNA-sequence alone and Identification of known gene-variations 
(sometimes called "typing a gene"). 

Typing of a gene could be crucial in some applications. For 
instance, organ-donation requires that the "immunological signature" of the 
donor matches that of the receiver. This "signature" is mediated by the 

15 Human Leucocyte Antigen (HLA) complexes {also known as Major 
Histocompatibility Complex, MHO) on the cell surface, and the 
corresponding genes are among the most varied in the human genome. 
Considering the importance of organ donation, the shortage of organ- 
donors and the fact that an organ cannot be stored for any longer time- 

20 periods, a rapid and accurate typing of the HLA-genes is required in order 
to make most use of the organs available for transplantations. 

Another application where a rapid and accurate identification 
of a gene is desired is when trying to identify unknown bacteria. A rapid 
identification of the bacteria causing the illness of a patient makes it 

25 possible to administer the correct medication eariy in the treatment of the 
disease, thus reducing the discomfort for the patient. Since every self- 
replicating organism so far studied use ribosomes when translating mRNA 
to proteins, analysis of one of the genes coding for the ribosome. for 
instance the 168 rRNA in the case of prokaryotes, could be used to identify 

30 the organism in question. 

There are several ways in which a gene can be identified, 
with the conceptually easiest being to sequence the entire gene and then 
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looking at the result. The main drawback is that this approach is time- 
consuming, and not easily scaled up using conventional methodology. A 
new method, Arrayed Primer Extension (APEX), lacks this drawback. 
APEX wortcs by immobilising a large number of primers to a solid surface, 
thus creating a DNA-chip. These primers are constructed to be 
consecutively overiapping over the entire gene of interest, so that every 
base in the gene will have a primer to its 5'-end. By adding fiuorescently 
labelled dideoxynucleotides, the primers will then be extended by one 
nucleotide using the sample DNA as template. It will thus be easy to check 
which nucleotide was incorporated, which in turn tells you the entire 
sequence of the sample DNA. 

Since some genes, like the HLA and 16S rRNA. have a large 
number of known variations, a prohibitively large number of primers have to 
be created in order to probe for all possible combinations of variant 
15 positions in the gene. Thus the array primer extension method APEX for 
resequencing would need more than 16.000 primers if all DQB alleles 
would be sequenced from a 500 bp long PGR fragment. If all DQB alleles 
in pairs should be combined the number of primers might be even higher 
which would be the situation for a heterozygote found in most individuals. 

But this might not be necessary, if some variations always or 
never occur together. This needs to be studied though, and a way found to 
determine the least number of primers (and what their sequences are) 
required for unambiguously identifying those genes. 

An object of this invention is to find and implement an efficient 
25 algorithm capable of doing just that. The algorithm should preferably also 
take into account the melting points of the primers, so that the extension 
reaction can take place under optimal conditions for all of the primers on 
the chip. It should also minimise the number of "self-extended"* primers, i.e. 
primers that can extend themselves without any sample DNA. This 
30 algorithm is then to be tested and evaluated on the \MJK and 16S rRNA- 
genes. HLA is chosen partly because of the importance of rapid typing of 
these genes, leading to the fact that there are many other methods to 
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which APEX can be compared. It is also because the HLA-genes are 
"eas/ to work with, since they rarely contain any insertions or deletions. 
These Icinds of variations in the gene could potentially create problems 
when designing primers for APEX. The 16S rRNA. on the other hand. 
5 contains insertions and deletions and can thus be used to see if the 
algorithm can handle such variations. 
15 jhe invention provides a method of identifying a set of 

extendible primers for use in the identification, typing or classification of a 
nucleic acid of known sequence having known polymorphisms wherein: 
iQ i) all possible nucleotide sequences of a chosen length of the 

nucleic acid are identified and their conresponding extendible primers, 
ii) at least one extendible primer is removed from the set 

wherein the at least one primer removed identifies a segment of the nucleic 
25 acid identified by at least one other primer. 

15 Preferably the method includes between step i) and it): 

ia) potential extensions for each primer are identified with 
respect to each nucleotide sequence. 

ib) for each extendible primer the identified potential extensions 
are compared to determine which pairs of sequences can be discriminated 

20 by the primer. 

35 Preferably a matrix of primers and pairs of primer extensions 

is prepared in binary form and is subjected to analysis by a set covering 
problem (SCP) algorithm as described in more detail below. 

The invention also includes a set of extendible primers, for 
25 use in the identification, typing or classification of a nucleic acid of known 
sequence having known polymorphisms, identified by the method as 
defined. Preferably the primers are attached by 5'-ends to a surface of a 
45 support on which they are presented in the form of an an^y. 

In another aspect, the invention provides a set of extendible 
30 primers, for use in the identification, typing or classification of a human 
leucocyte antigen (HLA) gene as indicated, the set comprising about the 
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number of primers indicated and being capable of distinguishing about the 
number of alleles indicated: 
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ULA gene 


Number of 


Number of 




Alleles 


Primers 


Class 1 


Hlj^-A 


91 


172 


HLA-B 


200 


<1000 




HLA-C 


47 


94 


Class 11 


DPA-1 


11 


26 




DPB-1 


74 


130 




DQA-1 


17 


130 




DQB-1 


34 


84 




DRB-1 


192 


<1000 




DRB345 


35 


94 



5 In another aspect, the invention provides a set of extendible 

primers, for use in the identification, typing or classification of 16S rRNA. 
wherein the set comprises about 210 primers and is capable of 
distinguishing at least about 1207 different sequences. 

In these aspects of the invention, the approximate number of 

10 primers is indicated. As indicated below, it may be possible by the use of 
the algorithms exemplified or other algorithms to generate slightly smaller 
sets of primers capable of distinguishing the number of alleles or 
sequences indicated, and these sets are envisaged according to the 
invention. Of course, other primers may be present in addition to those 

15 indicated as essential, and may be useful for checking purposes. The 
number of alleles or sequences indicated represents the approximate 
known number of polymorphisms or different sequences, and these will 
surely increase with time. 

In another aspect the invention provides a method of 

20 identification, typing or classification of a nucleic acid of known sequence 
having known polymorphisms, by the use of the set of extendible primers 
as defined, which method comprises applying the nucleic acid or fragments 
thereof to the set of extendible primers under hybridisation conditions and 
effecting template-directed chain extension of extendible primers that have 
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formed hybrids. Preferably template-directed chain extension is effected 
using four different fluorescently labelled chain-terminating nucleotide 
analogues, and results are analysed by an imaging system such as total 
intemal reflection fluorescence (TIRF) or scanning confocal microscopy. 
5 The various steps of the method may be performed as described in the 
literature for the known APEX technique. 

In another aspect the invention provides a kit for use in the 
identification, typing or characterisation of a nucleic acid of known 
sequence having known polymorphisms, comprising the set of extendible 

10 primers as defined. 

In another aspect the invention provides an array of sets of 
extendible primers as defined, for the simultaneous identification, typing or 
classification of two or more different HLA genes. 

With the present invention it has been realised that where a 

15 number of different alleles are to be identified, the total number of primers 
required to distinguish each of the alleles could be reduced as some 
primers would be common to all of the alleles, for example. Thus, with the 
present invention complete sets of primers for identification of each allele 
are identified and then the total number of primers in the combined sets is 

20 reduced using predetermined rules. 

Furthermore the present invention is based on the premise 
that as the primers are used to identify the presence or absence of a 
particular nucleotide sequence in any allele, the specific nucleotide that 
extends any particular primer is of less relevance than simply whether the 

25 primer has been extended. Thus, the problem of reducing the overall 
number of primers is greatly simplified rendering the problem one suitable 
for treatment as a Set Covering Problem (SCP). 

Embodiments of the present invention will now be described 
by way of example with reference to the accompanying drawings and 

30 examples, in which: 

Figure 1 is a diagram of a signal matrix in accordance with 

the present invention; 
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Figure 2 is a diagram of the corresponding binary matrix for 
the signal matrix of Figure 1 ; 

Figure 3 is a flow diagram of the steps for reducing the primer 
set in accordance with the present invention. 
5 The following Is an explanation to assist in an understanding 

of the principles underiying the manner in which the number of primers 
used in the identification of a plurality of sequences may be reduced. 

Theoretically the number of primers required to identify k 
sequences grows as Of/fWJ, where / is the length of the sequences as each 
10 sequence requires / primers. However, the less the sequences differ from 
one another, the fewer primers are required as many of the primers 
required for identification of a first sequence may also be of use in 
identification of another sequence. This effect becomes more pronounced 
the greater the number of sequences to be Identified and the greater the 
15 similarities. 

Considering an initial set of n primers required in the 
identification of k sequences, a signal matrix oikxn can be constructed. 
Each element in the matrix represents the signal, if any, that is generated 
by a particular primer with respect to a particular sequence. The signal will 

20 either be one of the four nucleotides 'A*. 'C\ *G\ or T or no signal 
Figure 1 is an example of such a signal matrix where, for example, the 
signal generated by primer 2 with respect to sequence 3 is T'. 

The signal matrix is then converted into a binary matrix that 
represents whether the signals for any particular primer differ with respect 

25 to different sequences. Thus, again with respect to primer 2. the same 
signal 'G' is generated for both sequences 1 and 2 but a different signal T 
is generated with respect to sequence 3. The binary matrix is constructed 
by considering each column (each primer) of the signal matrix and 
comparing each signal in that column in turn. Thus, as shown in Figure 2. 

30 the first row of the matrix represents a comparison of the signals for the first 
and second sequences, the second row represents a comparison of the 
signals for the first and third sequences and the third row represents a 
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comparison of the signals for the second and third sequences. Binary '0* 
represents the comparison revealing the same signal and binary '1' 
represents the comparison reveals different signals. In the case of primer 
2, as mentioned eartierthe signals for the first and second sequences are 
5 the same ('0') whereas the signals for the first and third sequences are 
different ('V). This conversion produces a matrix m x n where m=(k(k'1))/2, 
^5 Hence, for large numbers of sequences. 2m grows approximately as the 

square of the number of sequences. Figure 2 shows the binary matrix for 
the signal matrix of Figure 1. 
10 As the primers are required to enable the differentiation of 

sequences from one another, the reduction of the signal matrix to a binary 
matrix, representing differences in the signals obtained for different 
sequences, distils that element of information necessary to enable a 
25 selection of the minimum number of primers necessary to identify the 

15 individual sequences. From the binary matrix the least number of columns 
are selected such that each row contains at least one non-zero element. 
Thus, if one of the columns contained all 'Vs only that one column would 
be required. However, in the case of Figure 2. there is no single column 
containing all M's and so two columns must be selected, for example 
20 primers 1 and 2. Primers 1 and 2 together enable each of sequences 1 . 2 
35 and 3 to be differentiated and so the remaining primers are redundant. 

Where large numbers of sequences and primers are involved, 
the binary matrix renders the data contained within that matrix suitable for 
mathematical analysis. Once the selection of the reduced number of 

40 L • • J 

25 primers has been made, though, it is the signal matrix that is required 

during the use of the primers in the identification of the different sequences. 
Thus, the signal matrix is used to 'decode' the results of any analysis using 
^5 the reduced number of primers. 

In practice, large numbers of sequences and primers are 
30 involved and the selection of a reduced set of primers cannot be perfomned 
by simple inspection of the binary matrix. For large numbers of primers, 
selection of a suitable reduced set of primers can be performed by treating 
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the selection as a Set Covering Problem (SCP). An SCR is an integer 
optimisation problem and is well known in fields such as airiine crew 
scheduling, selecting manufacturing equipment and ingot mould selection 
In steel production. In such large scale problems that cannot be solved 

5 exactly (NP-hard), heuristics are used in order to generate a solution. As a 
SCP is NP-hard. global algorithms and algorithms that identify local optima 
are not very suitable on their own for a large scale SCP. They will simply 
require far too much computation, as they try to find a solution that can be 
proven to be at least locally optimal. For this reason heuristic methods are 

10 required instead. They do not claim to give even locally optimal solution, 
but are much faster. 

Two known computational methods that have been found to 
be effective in identifying reduced sets of primers are the Agreed/ algorithnr 
and Lagrangian relaxation algorithm. 

15 

Greedy Algorithm 

The most simple heuristic is the greecfy algorithm, where 
columns are added one at a time. The column to be added in each step is 
chosen so as to cover as many uncovered rows as possible (a row is 
20 covered if it has at least one non-zero element). In other words, if Sr is the 
set of columns already Included in the solution at iteration r, and Rr is the 
set of rows with no non-zero elements at Iteration r, column y/ is selected 
according to: 



This continues until all rows are covered, or until no more 
columns exist which can cover any of the rows still uncovered. Instead of 
minimising the term cy/Py. other temns can be used. Example terms are q, 



40 



j) = arg min Cj I Pj j 



25 



Equation 1 
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Ci/logi Pj or Ci/(P)2. Greedy algorithms of this type are described in "An 
Efficient Heuristic for Large Set Covering Problems", Vasko. Wilson. Naval 
10 Research Logistics Quarteriy 1984, 31:163-171 the contents of which Is 

incorporated herein by reference. The difference is in how much emphasis 
5 to place on the cost of the column versus how many rows the column 
covers. It is shown, however, that this entire class of heuristics share the 
same worst case behaviour. If we denote the set of columns In the solution 
as S and the solution value as Z, then the worst case behaviour can be 
described as: 
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Equation 2 



where 



2 = Z^y^y 

y-i J ^ M 

Equation 3 



IS 



In other words, how much worse the heuristic solution is 
compared to the optimal solution is dependent on the maximum number of 
non-zero elements in the columns. The advantage is that this algorithm is 
fast, even though its time complexity is O(m^n) (there can be a maximum of 
20 m columns in the solution, i.e. the maximum number of Iterations is m. For 
each iteration the matrix is traversed once to find the next column to be 
added). Altogether, we have that the time required to solve the problem in 
the worst case scenario will grow as the number of sequences to the power 
of five (four due to the number of rows, and one due to the number of 
25 columns). In the case of 16S rRNA (see later), where we have -1000 
sequences, the matrix will have --500,000 rovAS. The number of primers 
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5 

(cx)Iumns) is in this case -250,000. 

10 Laqranqian relaxation 

More sophisticated methods exist, which use other kinds of 
5 heuristics. One heuristic capable of generating the most optimal solutions 
is believed to be some kind of Lagrangian relaxation heuristic, where in 
each iteration the Lagrange multipiiers for each column are used to 
calculate the Lagrangian cost for the columns; Such a Lagrangian 
relaxation heuristic is described in "A Heuristic Method for the Set Covering 
Problem", Capara et al Technical Report OR-95-8. Operations Research 
Group. University of Bologna 1995 the content of which is incorporated 
herein by reference. A near optimal vector of these costs Is then calculated 
by a subgradient algorithm, before being used as input to a greedy 
algorithm. This is repeated until no Improvements in the solution can be 
15 made. 

In Lagrangian subgradient methods the Lagrangian of the 
original problem is considered instead of the original problem. In this case, 
the Lagrangian will be 
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where Ui is the Lagrangian multiplier for row /. Cj(u) is the 
Lagrangian cost associated with column/, and is defined by 

I- 1 

25 Equation 5 
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An optimal solution to Equation 4 is given by 



0 if Cy(u)>0 

1 if c^(m)<0 
0 or 1 if Cy(M) = 0 



10 



15 



20 



Equation 6 

L{u) can also be seen as an estimate of the lower bound for 
the solution, i.e. the sum of the costs for the columns in the optimal solution 
to the SCP will be > Lfu;. The solution to the SCP can be found by finding 
an optimal multiplier vector a' instead, but this will require much 
computation especially for a large SCP. But near-optimal multiplier vectors 
can be found within short time by using the sutgradleni vector s(uh defined 
by 

/-I 

Equation 7 

t; can be refined iteratively by using for example 
Equation 8 

where A > 0 is a step-size parameter and UB is an upper 
bound on the value of the solution. The initial u" can be defined arbitrarily. 
TO solve the SCP. first a near-optimal multiplier vector u is found. Th.s and 
Equation 6 is then used as a basis to form a feasible solution. The upper 
bound UB can then be updated to the value of this feasible solution {.f .t .s 
better than the previous best solution), and a new near-optimal mult.pl.er 
vector found and so on until convergence is reached. 



55 



wo 00/65088 



PCT/EPOO/03636 



-12- 



Another alternative computational method that may be 
employed to solve such a SCP is 'surrogate relaxation' in which in each 
iteration a corresponding continuous problem is solved and made feasible 
before a sub-gradient algorithm is applied. Alternatively, genetic algorithms 
5 may be employed in which the 'genome' consists of n bits, one bit for each 
of the columns. 

It should also be borne in mind that as the SCP operates on 
the binary matrix which only represents differences in signals between 
sequences for the same primer, a primer in the selected reduced set may 

in generate a negative. signal rather than a positive signal. A, C. G, T. To 
be sure that the sample does in fact contain a particular sequence it Is 
essential to ensure that for each sequence at least one primer generates a 
positive signal. Furthermore, in practice redundancy is desirable as all 
reactions may not occur as intended.- Therefore, the least number of 

15 positive signals as well as the least number of differences in the signal 
pattern is preferably larger than one. 

With reference to Figure 3, the following is a description of 
one method of selecting a reduced set of primers. 

Firstly, all possible primers are selected (10) using the 

20 standard APEX procedure to produce a first set of primers. During this 
selection a substring of the sequence to be analysed is used to construct 
one primer, then the substring is displaced by one base and another primer 
is constructed. This process is carried out from the start of the sequence 
until the entire sequence has been covered. Both strands of DNA are used 

25 and this is repeated for all sequences. The primers should be long enough 
to be capable of discriminating between exact matches and mismatches 
Involving one or two nucleotide pairs. Conveniently, the primers are 13bp 
long as this has been found to be sufficient to ensure the reaction, or longer 
to increase hybrid stability. However, to avoid steric hindrance on the chip 

30 each primer may be 5'-tailed. In this example, twelve 'T's are added to the 
5'-end of the primer so that the final length of the primers is 25bp. 

Next all primers that are not suitable as primers are rejected 
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(12) and the rest is included in a primary primer set. Unsuitable primers 
are those where the three bases at the 3'-end are complementary to any 
substring of the primer. In some instances this can result in the primer 
being extended by a neighbouring primer and not the sample DNA as a 
5 template and for that reason such primers are considered unsuitable. 

Also, any primers that would produce ambiguous signals are 
identified and rejected (14). A primer produces an ambiguous signal where 
it is not known which of the four bases is in the relevant position. 

Each of the remaining primers in the primary set primer is 
10 then compared to each sequence in tum to determine whether the primer is 
extendible by each sequence and if the primer is extendible the base with 
which it would be extended is determined. A signal matrix of the primers 
with respect to each of the sequences is thus generated (16). 

In order for a primer to be extended using the sample DNA as 
15 template, the three bases in the 3'-end of the primer must hybridise to the 
DNA. Othenwise the enzyme responsible for the extension will not be able 
to add a nucleotide to the primer. Of the rest of the primer (the poly-T tail 
excluded), at most two mismatches are allowed, otherwise the primer-DNA 
duplex is considered to be too unstable to be extended. 
20 In ordinary PGR. all the bases must match in order for the 

primer to be extended. But then the temperature is raised to the melting 
point. Tn,, of the primer in the extension step. In APEX, this reaction is 
canied out at 45**C, which is around 10*-20'* below Tm of most primers. 
This means that the primers will hybridise to the DNA despite a few 
25 mismatches, which is why two mismatches are allowed here. 

In some cases a primer could hybridise to a sequence in 
more than one position, and sometimes a primer could hybridise to both 
strands of one allele and give different signals. In those cases all the 
different signals are combined to form one resulting signal (e.g. 'A* and 'C 
30 together forms 'M'. which is the NC-lUB (NC-lUB, 1985) code for this 
combination). 

For each column of the signal matrix the entries for each row 



) 



wo 00/65088 PCT/EPOO/03636 

-14- 



are compared against one another, in other words for each primer the 
signals produced by the primer for each sequence are compared against 
each other. A binary matrix is thus generated (18) of the primers with 
respect to the identity or difference of signals for pairs of sequences. The 
5 binary matrix contains non-zero entries where the primer is able to 
distinguish between a pair of sequences. 

The number of pairs of sequences that each primer can 
distinguish between are counted and a score is allocated to each primer 
(20) in dependence on the total number of pairs of sequences counted. 
20 10 Thus, the number of non-zero elements for each primei* are counted. 

Primers that are unable to distinguish between any pairs of sequences are 
rejected (22) and the remaining primers are sorted (24) in order of their 
score with the primers with the higher scores at the beginning. 

25 

A core of primers is created next (26). The primer with the 
15 highest score is selected. Where two primers with equal scores exist, the 
number of positive signals is determined for each and the primer with the 
30 greater number of positive signals is chosen. If both primers remain equal, 

one is then selected arbitrarily over the other. After the main primer has 
been selected, the first twenty (five times the desired redundancy which is 
20 four here) primers giving positive signals for each sequence in turn are 

35 

selected for the core. All remaining primers are rejected. 

A greedy algorithm is then mn (28) using the core set of 
primers to identify the minimum number of primers necessary to distinguish 
40 each sequence. As the greedy algorithm is run, primers are added one at 

25 a time with each primer being selected in turn in relation to the number of 
uncovered rows it is capable of covering. When all rows are covered at 
least four times the reduced set of primers is checked for any sequences 
that has fewer than four positive signals and extra primers are added as 
necessary to meet this minimum requirement. 
30 A redundancy check is then performed (30) to identify 

50 whether any more primers can be removed. During the redundancy check 

each primer is "tentatively" removed in turn to see whether the remaining 



55 



wo 00/65088 



PCT/EPOO/03636 



-15- 

primers meet the minimum requirements. 

If not. the next primer is tried. Otherwise the primer is 
temporarily removed from the set. and the process continues with the next 
primer in line. This process continues until no more primers can be 
5 removed, in which case the last primer to be removed is added back to the 
set. and the next primer in line tentatively removed and so on. This can be 
viewed as a depth-first search of a tree where the nodes are combinations 
of primers, and the number of primers in each node is one less than in a 
node one level above. The root node thus contains all primers from the 

10 greedy algorithm. It has p (the number of primers after the greedy 
algorithm) primers in it. It also has p child-nodes (because there are p 
ways in which you can remove one primer from a set of p primers), each 
with p-7 primers. Each of them has p-1 children with p-2 primers and so 
on. In this way, all possible combinations of primers In the set fulfilling the 

15 requirements are found, and those combinations with the same, least 
number of primers are saved as the final primer sets. 

Instead of applying greedy algorithm to the core set a 
modified algorithm called CFT may be applied. 

20 Laoranaian suboradfent 

This algorithm consists of three main phases: A subgradient 

phase where a near-optimal multiplier vector is found, a heuristic phase 

where a solution to the SCP is found and column-fixing, designed to 

improve the results of the heuristic phase. 
25 In the subgradient phase, a near-optimal multiplier vector u is 

found using Equation 8, At the beginning, the starting vector used is 

defined as 

u. =mm — - — 
*-i 

Equation 9 
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Later calls use the last vector u before column fixing, and 
apply a small perturbation before using it as the starting vector. The 
perturbation Is randomly (and uniformly) distributed in the range ±10% for 

5 each element. The sequence of multiplier vectors is considered to have 
converged when the improvement in L(u) in the last 50 iterations is smaller 
than 0.1%. or when the number of iterations reached 10 x m. The factor A 
in Equation 8 was set to 0.1 at the beginning, and was updated as follows: 
Every 20 Iterations, the best and worst lower bounds L(u) during those 20 

10 iterations are compared to each other. If the difference is larger than 1%, 
the value of X Is halved. If the difference is less than 0.1%, X is multiplied 
with 1.5. In the first call, the upper bound, UB, used is the sum of the costs 
of the first primers that together cover all rows four times. Otherwise it Is 
the value of the best solution found so far. 

15 In the heuristic phase, the last vector from the subgradient 

phase is used to generate a sequence of multiplier vectors (again using 
Equation 8), and a feasible solution constructed for each of the multiplier 
vectors. The procedure used to generate a feasible solution is a variation 
of the greedy algorithm, where each column is scored according to 

Equation 10 

where R is the set of uncovered rows in each step. The 
column with the lowest q, i.e. the columns with the best "gain/cosf-ratio. is 
25 added in each step to the solution. This continues until no improvements to 
the best solution (i.e. minimum number of primers) have been made for 50 
iterations. 
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After the heuristic phase column fixing is applied to the 
solution. Columns that are absolutely necessary in order for a row to be 
covered (i.e. if there are only e columns covering a row and each row is to 
be covered e times) are fixed. These fixed columns are then used as a 
5 starting point for the greedy algorithm, and the first max{l200/mj, 1} 
columns chosen therein are fixed as well. 

These three phases are then applied again to the problem, 
with the condition that the fixed columns must be included in the solution 
this time. Columns already fixed in a previous round can not be removed 
20 10 from the solution. This goes on until either all rows are covered by the 

fixed columns, or the cost of the fixed columns is larger than the estimated 
lower bound for the entire problem or if no new columns were fixed In the 
last iteration. 

When the three phases are done, the problem is refined, in 
15 order to improve the solution. Here, each column in the best solution found 
so far is scored according to 



25 



30 



35 



40 



Equation 11 



where 



20 

Equation 12 



and S is the set of columns in the solution. The term Ui(Ki - 1) 
is the contribution of row / to the gap between the estimated lower and 
25 upper bound of the problem. This is then split uniformly between all 
columns in the solution covering that row. Columns with small 4 
(contributing the least to the gap) are then likely to be part of the optimal 
solution. The p columns with the smallest ^ are then fixed before the entire 
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aigorithm is applied again to the resulting sub-problem. (Column fixing 
here has nothing to do with column fixing after the heuristic phase, so 
columns fixed there need no longer be fixed here), p is the smallest value 
satisfying 




ex m 
Equation 13 



where {jk} is the set of columns in the solution ordered with 
ascending and /> is the set of rows covered by column j. n is in the range 

10 0,..1 and controls the percentage number of rows removed after fixing, n = 
1 means that no rows will be uncovered, while n = 0 means that no 
columns will be fixed before reapplying the algorithm. (Since each row has 
to be covered multiple times, in this case it is not actually the number of 
rows but the number of elements covering the rows that are regulated by 

15 7i). In the beginning, n is set to 0.3 and is multiplied with a = 1 .1 if the best 
solution so far was not improved in the last application of the three main 
phases. If a better solution was found, tc was reset to 0.3. Because of the 
density of the matrices, the number of columns fixed in this step was also 
set to be at least one more than In the previous iteration (If no 

20 improvements were made). Othenwise the same number of columns would 
be fixed In a number of iterations before the value of :t is large enough to 
allow more columns to be fixed. 

The algorithm is iterated until either the value of the best 
solution is less than the estimated lower bound, all columns in the best 

25 solution found so far are already fixed in the refining step or a time liniit is 
exceeded. The time limit in this case was arbitrarily set to as many 
seconds as there were rows in the problem. However, the time limit is only 
checked before the refining step. If it is not exceeded, a whole iteration of 
the algorithm will be executed before another check is done. Here too a 
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check was done afterwards to see if primers could be removed without 
breaking any constraints. 

With this algorithm no pricing is perfonned. Pricing is used to 
update the core problem, exchanging columns between the core problem 
5 and columns outside the core, it was not included here since it was argued 
that since the costs of the columns are all the same, the best columns 
would be those with the largest number of non-zero elements. These 
would be the first columns to be added to the core, and the columns not 
included in the core would most probably not be better than those included. 

10 Also, the pricing step will require some computation which will extend the 
time required by this algorithm. As is. the computational requirement of this 
algorithm is several orders of magnitudes higher than for the greedy 
algorithm. Finally, the main memory available in the computer puts a limit 
on the how large the problems can be. If pricing was included all data will 

15 not fit into the physical memory, forcing the computer to use a swap-fiie 
which would Increase the computation times considerably. 

Using both alternative algorithms described above a minimum 
number of primers were identified for various sequences. The results are 
set out below. 

20 It will be apparent that the initial manual rejection of primers, 

steps (12, 14 and 22) need not be performed and instead the algorithms 
can be applied to the original complete set of primers. However, the initial 
rejection of obvious failed primer candidates can significantly reduce the 
computational time required in the later stages. Similarly, in many cases 

25 the final redundancy check (30) need not be performed as in many cases 
little or no reduction in the number of primers was achieved by this final 
check. 

Furthermore, although in the method described above the 
primers were initially sorted in order of score, this need not be performed. 
30 The algorithms for stripping out redundant primers are capable of operating 
with any order of primers including a wholly random order. However, 
slightly better results were obtained when ordering by score was 
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performed. 

Collecting sequences 

The HLA-sequences were available internally from 
5 Amereham Pharmacia Biotech (release Decenriber 1997), and included 91 
alleles from HLA-A, 202 HLA-B. 47 HLA-C, 11 HLA-DPA1 (coding for the 
a-chain), 74 HLA-DPB1 (p-chain). 18 HLA-DQA1, 34 HLA-DQB1. 192 HLA- 
DR1 and 35 sequences In all of HLA-DR3, -DR4 and -DR5. The length of 
these sequences range from -250bp to -HOObp. 

10 The 16S rRNA-sequences were collected from GenBank 

(Benson ei al., 1998), an annotated database of all publicly available DNA 
sequences. Only a subset of all the available 16S rRNA-sequences were 
used. The sequences used were all from organisms that could be 
identified using either the MicroLog or the MicroStation system from Biolog 

15 inc., or the API systems from Counterpart Diagnostics. These systems 
utilise differences in metabolism in order to identify the organisms, which is 
the most common way of identifying micro-organisms today. Altogether, 
1207 sequences from 523 different organisms were collected from 
GenBank. 269 of those 523 organisms had only one 16S rRNA sequence 

20 among those 1207 sequences. The length of these sequences is between 
-lOOObp and -1500bp. 



Data set 


No. sequences 


Mean length of sequences 


DPA1 


11 


517 


DPB1 


74 


288 


DQA1 


17 


€16 


DQB1 


34 


490 


DRB1 


192 


324 


DRB345 


35 


400 


HLA-A 


91 


944 


HLA-B 


200 


900 


HLA-C 


47 


1003 


165 rRNA 


1207 


1452 



Table 1: Details about data sets. 
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10 



20 



The program was written using the Microsoft® Visual C++®, 
version 5.0 cx3mpiler. It was executed on a PC with a Pentium® UUX 233 
MHz processor. 64 MB RAM and Windows® 95, unless otherwise 
indicated. All execution times are for the entire program, including I/O. 

As can be seen in Table 2, the binary SCP matrices were 
quite dense. The density (I.e. the number of non-zero elements in the 
matrix) usually lies around a few percent, of course depending on the 
application. A higher density means that fewer columns are needed in 
order to cover all rows. This is offset in this case by the fact that all rows 
were required to be covered multiple times. Another consequence of this 
high density is that the number of primers needed according to the greedy 
algorithm could be much higher than in the optimal solution. (Recall that 
the worst case behaviour of the greedy algorithm is a function of the largest 
column-sum of elements.) 



Dataset DPA1 DPB1 DQA1 DQB1 DRB1 DRB345 HLA-A HLA-B HLA-C 16S rRNA 
No. rows 55 2701 136 561 1B336 595 4095 19900 1081 727821 
Density f%) 47.89 20.73 36.31 42.18 24.98 37.70 36.31 32.33 30.41 2.04 



Table 2: Some details about the binary SCP matrix. Data are 
calculated for all primers in the primary set. 

The program could be considered as consisting of two 
phases. The first phase involves constnjcting all primers and finding out 
what kind of signal they will get for each sequence. The second phase is 
the optimisation phase, were the SCP is solved. Some details about the 
first phase can be found in Table 3. 



Dataset DPM fiEfil DQA1 DQB1 pRBI DRB345 HLA-A HLA-B HUV-C 16S rRNA 

First sot 1747 1885 2487 2891 3891 3031 4756 4994 4293 247877 

Primary sot 1333 1475 2166 2730 3651 3016 3886 4585 3354 247877 

Core set 106 321 213 244 385 203 595 750 338 

■nma(s) 4.67 6.81 11.26 16.51 42.29 14.56 124.74 286.82 61.29 



2377 
150632 



25 



Table 3: Number of primers in different stages of the algorithm and time to 
get signals for all primers. The number of primers in the core are for 
homozygotes. 
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One explanation to this high density is that the sequences in 
the data sets are quite similar to each other, so that most primers will 
hybridise to and give signal for more than one sequence (either the same 
or different signals). This is also indicated in Table 3. where for some data 
5 sets there is a noticeable drop from the number of primers in the first set to 
the number of primers in the primary set. Most of this reduction is due to a 
primer having the same signal for all sequences, which in turn means that 
all sequences have a substring that is similar enough for the primer to 
hybridise to and that the nucleotide after the primer is the same for all 

10 sequences. In contrast, the 16S rRNA data set has a much lower density, 
and no reduction in the primers going from the first set of primers to the 
primary set. As the sequences in this data set come from organisms which 
might be only distantly related to each other, there need not be as much 
similarity between the sequences as there is in the HLA data sets. Another 

15 explanation is this: If all k sequences except one give the same signal for a 
primer, that column in the binary SCP-matrix will have k-1 non-zero 
elements. The density (for that column) will then be [k-1) I {k(k-1)/2) = 2/k, 
In other words, the density will be higher for smaller values of k, and 
smaller for larger values. This means that it would be "natural" for smaller 

20 matrices to have higher densities, and larger matrices to have lower 
densities. 

In the second phase, solving the SCP, a few different 
approaches were tried. The results, the minimum number of primers 
needed and the time required to find this number, can be found in Table 4 

25 and Table 5. Even though the worst case behaviour of the greedy 

algorithm is not so good in this application, the results are not much worse 
than when using a Lagrangian subgradient (CFT) method. The greedy 
algorithm typically needs two or three more primers, while the computation 
times are much lower for the greedy algorithm. 

30 The results show that it is worthwhile to check the results 

from the greedy algorithm for redundancy. In all cases except one primers 
could be removed and the resulting primer sets still fulfil all requirements. 
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This is not tme for the CFT algorithm, however, as there is only one 
instance in which the result could be improved. On the other hand, since 
there is some randomness in the CFT algorithm (an old multiplier vector Is 
disturbed randomly before being used as a starting vector In the next 
5 iteration), the results can differ from one execution of the algorithm to 
another. Sometimes the results can be improved, and sometimes not 
(results not shown). 

Greedy 11 42 32 31 48 24 73 103 51 210 

Timo(s) 0.27 1.37 0.61 0J1 11.5 0.66 4.61 31.36 1.15 9921.46* 

Final 11 41 30 29 44 21 72 99 47 197'^ 

TotaHs) 0.27 1.81 0.72 0.88 30.3 0.71 6.48 85.14 1.76 >3Q00OO^ 

Table 4: No. of primers after the greedy algorithm and time 
10 spent by it. Also final nr. of primers after check for redundancy and the total 
time spent solving the SCP. *Value from a 300MH2 Pentium II with 512MB 
RAM running Windows NT 4.0. '^The computation was halted before 
completion due to time constraints. 



Oataset 


DPA1 


DPB1 


PQA1 


PQB1 


DRB345 


HM\-A 




CFT 


10 


38 


26 


27 


20 


69 


47 


Tim© (s) 


10.22 


2748.92 


60.80 


372.56 


427.32 


4547.33 


1091.37 


Final 


10 


38 


26 


27 


20 


69 


45 


Total (s) 


10 22 


2749.14 


60.86 


372.61 


427.38 


4548.49 


1111.70 



Table 5: Results using modified algorithm CFT. 



One reason CFT is not much better than the greedy algorithm 
could be that it was designed for other Instances of SCP. The SCP arising 
in this application differ in three aspects from those: A) The density is much 
20 higher. B) All rows are to be covered multiple times and C) The costs of all 
columns are all the same. 

A comparison was made between the results from the greedy 
algorithm and from CFT in Table 6. Most of the primers (70% or more) 
were chosen by both algorithms, indicating that these primers are likely to 
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be part of an optimal solution. However, this is only an indication as the 
only way to prove this is to find an optimal solution. This will require far too 
much time even for the smallest data set as the problem is NP-hard. 



Dataset 


DPA1 DPB1 DQA1 DQB1 DRB345 HLA-A 


HLA-C 


Greedy 


11 41 30 29 21 72 


47 


OFT 


10 38 26 27 20 69 


48 


Same 


7 33 22 22 14 62 


38 


Percent (%) 


70.00 86.S4 84.62 81.48 70.00 89.86 


80.85 



5 Table 6: Comparison of primers from the two different 

algorithms. 

Results from combining HLA sequences in order to 
differentiate between heterozygous individuals can be found in Table 7. 

10 CFT was only used for the two smallest data sets due to the time re- 
quirements. It performed slightly better than the greedy algorithm on those, 
but only by one primer on each data set. There are heterozygotes that can 
not be distinguished from another heterozygote. which can be seen in 
Table 7. This happens because the combination of two sequences to fomi 

15 one heterazygote could result in exactly the same signal pattern as another 
combination of homozygotes. In other words, some rows in the signal- 
matrix will be the same leading to some rows in the binary SCP-matrix not 
containing any non-zero elements at all. For some of those pairs listed, 
this is not true, however. They are listed because there were not enough 

20 primers that have different signals for these pairs, and so could not meet 
the requirement of at least four different signals in the signal patterns 
(Table 8). For the rest, it is simply a limitation of this technique to type 
HLA-genes. To be able to identify the alleles forming each heterozygote, 
primers that amplify alleles selectively should be used in the PGR step. 

25 This will remove the ambiguities as some heterozygotes simply will be 
transformed to homozygotes since only one of the alleles in the 
heterozygote will be amplified and not the other. 
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Datasat 


DPA1 


DPB1 


DQA1 


DQB1 


DR9?45 


HUV-A 


HIA-C 


Greedy 


26 


130 


51 


81 


94 


172 


94 


Time (s) 


0.99 


9229.57 


7.41 


294.51 


453.19 


20826^* 


1212.59 


CFT 


25 




50 










Time(s) 


1943.62 




8427.82 










Amb. heL 


0 


16 


2 


2 


6 


19 


4 


Percent (yci 


0.00 


0.58 


1.31 


0.34 


0.95 


0.45 


0.35 



Table 7: Results from heterozygous pairs. Number of primers 
needed, the time spent, how many heterozygotes that did not differ by at 
5 least four signals from any other heterozygole and the percentage of total 
number of hsterozygotes. *Value from a 300MH2 Pentium II with 512MB 
RAM running Windows NT 4.0. 

Unfortunately, it was not possible to obtain any results for 
10 heterozygotes for the data sets DRB1 and Hl-A-B, as these were too large 
to run on existing machines, A very approximate extrapolation of the 
primers needed for these data sets suggests that the total number of 
primers for all HLA sets together would be <1000. which can placed on one 
chip without problem (one chip can contain up to --5000 primers). Without 
15 the reduction obtained above, at most two genes could be tested on each 
chip. With the reduction, all nine HLA genes and the 16S rRNA gene can 
be tested on one chip, and with plenty of room to spare for other genes as 
well. This mal<es APEX more versatile, as it allows a family of related 
genes to be tested using only one chip instead of several. 
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10 



15 



20 



25 



Pilri QperOSOl DPBV2101 
PilT ) OP01*2I01 OPSriAOl 
No. «IM. 3 
P»1f 1 OPBrOSCl DPOf»»01 
No. tflll. 3 

Psir 1 OPSl'OSOl DPBl*a&01 
Pair 2 0P8 •.■1091 OPB f 210 1 
No. tflft. 1 

Pair 1 OPBI'0801 OPBIM40- 
P«lr3 OPBflOOi OP31*S7ai 
ho.dllf. C 

Ptirl DPBI'OBOI oppi*ioa< 
P»lr J DPai-lTJl OPBV4«0l 
K«. <ll«. 0 

Paul OPBI'C&OI DPBl'SSOt 
Pair 2 DPBt-aiOl DPBl'SSOl 
Ne.dltf- 0 

Pair t DPBl'OBOl DPei**SOi 
Pair 1 DPBI'IOOI OP6rt40t 
No.eitf. 0 

Pair 1 CPftfJOOt OP9t*5301 

PaHa CPBV4C01 DPer4«oi 



Pmtr \ 
Patr ] 

No. dllf. 

Pair 1 
Pair a 

He.din. 

«■ 

Piir 1 
Pair 2 

Ho.dlfl. 



DQAI-OIBt OQ*i-fl10S 



OaBl'OftO* D0BI*0fil2 
3QB1'OS09 DQflt'OSOa 
2 

OHS**C101-.ORB**01«11 
ORB4*01t)llO«B**050lH 
0 



Pair 1 ORBi'0101 1 OftB«*OI0) 
Pilr 2 DRS4'0103 onB4*MaiN 
Nfl. dllt. C 

Pair 1 GBe4'OJ01HORB4*020tH 
Pair a CRB 4*0301 NORB4*C301N 
NQ. «m. 0 



Palrl C-M2C5 

Pilr2 Cw-12042 
He. dfff. t 

Pair t C«*13042 

Paua C<r*120! 

NO. itfi. <■ 



CWM6Q2 



Pair 1 A'OIOt A*241lN 
Pab 2 A-01C4H A'HQZ 
Ho. dlK. 0 

Pair 1 A*020t A'02C5 
Pair a A-0202 A"O206 
No. dift. 1 

Pair 1 A*O201 A'OaM 
Pair 2 A*0>14 A*022] 

NO. din. 1 



A'OaQI 
A'0220 



A'2*0B 
A*241J 



Pain A-0201 
Pair 2 A-O20S 
NO. din. 0 

Palrl A*0201 
Pair a A'0211 
NO. din. 3 

Palf 1 A-020t 
Pair 1 A*0222 
N«. difr. 



Pair t A*C7C2 A'020B 
Pair I A*C2I4 A-0222 
Ma.dtlf. 0 

Pilf I A-02W A*2«01 
Pair a A'032? A'ZeOB 

No.flir'- 2 

Pal/ 1 A'a402 A'asoa 

Pair J A-2407 A-2501 
No. dIM. 0 

Palfl A'I402 A'fiSOia 
Pair 2 A'3407 A'BBO^I 
No. dift. 0 

Pair 1 A'250i A'BBOia 
Paira A'2S02 A*Bfl9il 

NQ. din. 0 
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Table 8: Heterozygous pairs that do not differ enough in their signal 
patterns, and how many signals they differ with. 



The results of this work are summarised in the following 
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Class 1 


Number of 


Primers 


Class 11 


Number of 


Primers 




alleles 


needed 




alleles 


needed 


HLA-A 


91 


172 


DPA1 


11 


26 


HLA-B 


200 


<1000 


DPB1 


74 


130 


HLA-C 


47 


94 


DQA1 


17 


51 








DQB1 


34 


84 








DRB1 


192 


<1000 








DRB345 


35 


94 



Table 9. Number of primers needed to discriminate between 
heterozygote HLA samples. 

5 

Some sets of primers indicated in Table 9, and also the set 
indicated for 16S rRNA. are set oul in appendix 2. 

Primers can be arranged on the surface of a support in such 
a way that different studied types, genes, alleles, species etc. form easily 
10 recognised characters such as figures or letters. These character forming 
primers can be additional primers of common origin from the gene of 
interest and be used for validation of the process. 

The following demonstration is based on the HLA Class I! 

DQB gene. 

15 

Experimental 

Materials 

Amplification: 

20 DNA: Four homozygote for DQB cell lines, with alleles 0402, 0301 . 0601 1 
and 0201. 

Primers: Primer DQB 9246 from Williams et el -96 and DQB 96012 from 
Amersham Pharmacia Biotech HLA DQB typing kit. covering exon 2. 
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generating a fragment of 300 base pairs. 

Amplification reagents: PGR mix from the Amersham Phannacia Biotech 
HU\ DQB typing kit. a prototype kit. 

All amplifications were spiked with dUTP. to get a final concentration of 100 
5 or200mMdUTP. 

Enzymes for fragmentation of PGR products: 
Shrimp alkaline phosphatase (SAP)1 \JI\i\ APB. 
Uracil-DNA-glycosylase, (if from PE UDG = UNG) 1 U/^l NE Biolabs. 

10 

SAP will degrade (dephosphorylate) all free dNTPs and UDG 
will remove all dU from the DNA and after heating the strands will be 
broken at these points. This step is applicable to any DNA fragment. 

15 Primers for spotting: 

All 84 primers for the 500 bp fragment were ordered from 
LTl/GIBCO BRL Custom primers service. All were 25-mers with an amino- 
activated 5* -«nd. For primer sequences see appendix 1 , Self extended 
primers were N. A, C. G and T as controls with the following sequences: 

20 N: amino TTT AGC CTT AAC GCC T N TGAC GTCA 

A, C.G. T: amino TTT AGC CTT AAC GCC T X TGAC GTCA, where X is 
A. C. G or T. 

Extension reagents for the APEX reaction 

25 Dyes: Specially synthesised for Baylor by Du Pont and /or APB 

Cy2 - ddCTP (equal to fluorescein) 50 
Cy3 - ddATP 50 \M 

Texas Red - ddGTP 50 ^iM 

Cy5 - ddUTP (often written as T in many of the reactions and 

30 results) 50 

lOx ThemnoSequenase^ DNA polymerase buffer (TS): 
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260 mM Tris-HCI pH 9.5; 65 mM MgCl2, ThermoSequenase DNA 
polymerase (Amersham Pharmacia Biotech) 4 U/^l. if needed dilute with 
T.S. dilution buffer (=10 mM Tris-HCI pH 8.0; 1 mM p-mercaptoethanol. 
0.5% Tween - 20(v/v). 0.5% Nonidet P-40 (v/v). TS was used from a 150 
5 unit stock and diluted 1 ^il + 37 ^\ dilution buffer. 

Methods 

Preparation of glass slides before spotting of primer: 

Arrange 25-30 cover slips (24 x 60 mm) in a stainless staining 

20 10 tray. 

Immerse the tray in glass staining dish with acetone to fully 

immerse slides. 

Place the glass staining dish in sonicatorfor 10 minutes. 
Remove the tray from acetone bath, shake of excess of 
15 acetone and rinse several times (at least twice) in MilliQ water 

Immerse tray in 100 mM NaOH and sonicate for 10 minutes 
30 (a few more minutes, no problem). 

Remove the tray and shake of excess of NaOH and rinse 
several times (at least twice) In MilliQ water. 
20 Immerse tray in silane solution and sonicate for 2 minutes. 

Wash slides by immersion in 100% EtOH once. 
Dry the tray with the slides using nitrogen with a high velocity 
(without breaking the slides). 

Cure the slides in a vacuum oven at 100*»C over night or until 
25 they are used for spotting (at least 20 minutes vacuum is needed). 
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S potting of oliQOs: 

All spotting was done with a spotter with 96 parallel capacity. 

Each slide was spotted with three replicas of the primers. 

After spotting the slides were allowed to air dry for 5 to 15 
minutes, when dried they were marked. They were stored at room 
temperature, in a dry place. In the trays until used. 
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nOR am plification 

The DQB amplification was done according to the method 
described by Williams et a/. -96 using a 33% dUTP mix. After 40 cycles 
30 sec; 55°C, 30 sec; 72«C. 30 sec), one microliter of the PGR 
5 products was tested on a 1 .5% agarose gel. before the fragmentation step. 

Williams, Bassinger. Moehlenkamp, Wu, Montoya. Griffith. 
McAuley. Goldman. Maurer: Strategy for distinguishing a new DQB1 allele 
(DQBr061 1) from the closely related DQBr0602 allele Tissue Antigens. 
1996, 48:143-147. 
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Fraamentatln n nf PCR products; 

Before APEX can be done all DMA fragments must be 
fragmented so all new fragments can get access to the primer on the chip. 



15 Set up: 

5 nl DNAfrom a PCR reaction (1/10 of the PCR reaction) 
30 2 \x\ SAP (Shrimp alkaline phosphatase) 1U/^1 APB 

1 jil UDG (Uracil-DNA-glycosylase) 1U/ul NE Biolabs 
15^1 water 

35 20 Total:' 23 |il 

Incubate 37''C for 2 hour. 

The samples were frozen and stored until they were used. 

Inactivation of enzymes at 100«C for 10 minutes can be done, 
but not needed since this is the first step in the APEX reaction. 



25 



30 



Extension method for the APEX reaction 

Rlirie treatment: 

Start with washing the slides in hot water (90 - 98°C. not 
boiling) for 2 X 5 minutes In a 50 ml Flacon tube. When the slides are 
ready, remove them from the tube with a forceps and place them on a dry 
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heater block at 48°C. The slide(=DNA chip) is now ready for adding the 
reactions. 



APEX reactions set up: 



23 \x\ DNA from the fragmentation step. 

3 10x TS reaction buffer (the rest of the buffer comes from PGR and 
UDG cleavage) 
1 7 for cover slip method. 
10 Heat denature at 100°C for 7-10 minutes, target 8 minutes, not longer. 
Spin the tube quickly and add quickly 
1 ul ThermoSequenase DNA polymerase (4U) 
25 1 |i| Dye-mix (50 \M of the four dideoxynucleotides A, C, G, and T, 

separately dye labelled). 
15 Then the reaction mix was physically spread out over the 

primer array with the tip of a pipette tip. Incubate at 48*'C until no trace of 
solution is seen. This takes about 8 minutes. 

Wash with hot water for 2 - 5 minutes. 2 times. Ready to 
read on detection instrument. 
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Detection 

The detection system is a total internal reflection fluorescence 
(TIRF) system, where microscopic slides are placed on top of a prism with 
oil on to link a laser beam in to the glass slide. The system has light of five 
25 different wave lengths from five different lasers to vary between. In this 
experiment only four were used. To detect Cy2 a laser with 488 nm was 
45 used, for Cy3 a 532 nm, for Cy5 a 635 nm and for Texas Red a 670 nm 

laser were used. Image related software were based on Image Pro Plus 
3.0. 
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Results 

Amplification of HLA DQB alleles 

The DNA from the four DQB homozygote cell lines were 
5 amplified according to the protocol in Williams et al. -96 with two different 
concentrations of dUTP. In addition to this. DNA from six different 
heterozygotes were amplified. AH amplifications worked well and the 
expected 300 bp fragment were seen from all samples. 

10 APEX reaction with DQB chip 

Primer chips were washed and fragmented PGR products 
were incubated on the chip acconjing to the protocol. The image was 
compared to the expected pattern. The expected pattern was similar to but 
somewhat different from the recorded pattern, the reason for this is that the 

15 set up was planned for a 500 bp fragment, but the actual fragment used 
was a 300 bp PGR fragment. 

Homozvaous cell lines results 

Figure 4 shows the results from a cell line homozygous for 
20 the DQB 0204 allele. The pattern shown in the image is very close or 

similar to the expected results from exon 2. 

In all reaction the control primers worked well and the four 

dyes were used in the same frequencies. In the case with a 500 bp 

fragment for DQB typing the primers for allele 0402 were placed in such a 
25 way that they formed figures. In Figure 4. panel D, most signals are seen 

forming a "2" from the 300 bp fragment, and the missing signal will be seen 

when the large PGR fragment is used. This clearly shows that primers can 

be placed in a clever way to form figures. 



30 



Heterozygous results 

For the heterozygous test only one of the four dye reactions 
worked. Some of the expected spots from the heterozygous sample were 
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not seen, but this is probably due to the fact that no control signals were 
seen In the lower right hand comer, where the signals were weaker then in 
10 other part of the slide. 

As this experiment shows, a limited number of primers can be 
5 used for HLA typing and if they are placed in a dever way the interpretation 
of the results is very simple. Both homozygous and heterozygous samples 
can be correctly analysed with this method. 

Continuation 

20 10 An algorithm was developed in order to select the minimum 

number of primers needed to identify different genes using APEX. It was 
applied to the following HLA genes: HLA-A. HLA-B, HLA-C, HLA-DPA1 . 
HLA-DPB1, HLA-DQA1, HLA-DQB1, HLA-DRB1 and HLA-DRB345. It was 
also applied to the 16S rRNA gene. In the case of HLA-DQBI , the primers 
have been shown to work as intended. As is. a few assumptions were 
made (such as how many mismatches to be allowed between the primers 
and the sample DNA) that need to be tested and possibly refined. 

Another improvement that can be made is the following: As is. 
the program works only with discrete signals, e.g. either there is a signal 'A* 
or there is not. either there is a signal 'G' or there is not and so on. A more 
precise approach would be to predict how strong the signals will be for 
each primer on each sequence. A rough estimate of the signal strength 
should be possible given some thermodynamic data about the primers, 
most notably their melting points. With this infonmation. and knowing the 
concentration of DNA in the sample among other things, the proportion of 
primers on the chip that will actually react with the sample DNA should be 
possible to estimate. It would thus allow a rough estimation of what 
strength the different signals will have. It will not be very precise, and the 
estimate might possibly be off by a factor 2 or more, but it will still give 
some infomiation about what signals to expect from the chip. 

Given the melting points of the primers, the temperature at 
which the reaction on the chip is carried out could be optimised as well. 
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Since the sequences are known, it is possible to estimate the melting point 
of any primer to any sequence when there are a few mismatches, This 
10 could be done for all primers on all sequences, and a range of 

temperatures calculated. The actual temperature to use could then be 
5 chosen so as to be as optimal for as many primers on as many sequences 
as possible, instead of as now at a standard temperature. 

Another possibility would be to try other heuristics to solve the 
resulting SCP. Even though CFT does give better results than the greedy 
algorithm, it is not by much. It could be that Lagrangian relaxation methods 
20 10 really are not suitable for unicost problems, but the only way to find out is to 

try heuristics based on other ideas. It might be possible to reduce the 
binary SCP-matrix as well, before applying any heuristic on it. Some rows 
in the matrix could end up the same, in which case one of them could be 
removed in order to reduce the number of rows and thus speed up 
15 computation. No figures of how many rows might be the same exist, but it 
could be worthwhile examining this possibility to reduce problem size. 

The algorithm itself could be improved. The complexity of the 
redundancy-check phase can be slightly reduced by having a vector 
consisting of the sums of the rows in each node. For each child-node, the 
20 column to be removed is then subtracted fnsm this vector of sums. This 
operation can be carried out in 0(m). and the final complexity will then be 
0(m X N(p. p)) instead. For the greedy algorithm, another possible 
improvement is to check the primer set for redundancy each time a primer 
40 was added. The complexity for the greedy algorithm will be the same, as 

25 the check will take 0(m xp) (i.e. same as each iteration in the greedy 
algorithm) each time (with the improvement just mentioned). The check 
could take longer, but that is unlikely as that would imply that one primer 
could make several other primers redundant. The main advantage is. of 
course, that no redundancy check with its rather high complexity is needed 
30 afterwards. 

50 The most serious problem is the sheer size of the problems. 

For the 16S rRNA data set, around 300 MB is required just in order to store 
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all the primers and their signals. Add to that the fact the all primers need to 
be traversed once for every iteraUon in the greedy algorithm, and the result 
is that it will take quite some time as well. This also means that it is not 
even feasible to use more elaborate algorithms such as the CFT algonthm 
. on the 1 6S rRNA data set. unless a much more powerful computer is 
available. On the other hand, algorithm CFT would probably benefit quite a 
lot from a parallel computer, since much computation could be carried out 
as vector-operations. It should then be possible to spread out all 
computations on several processors, thus reducing the time required. It 
would also reduce the memory requirements on each processor (but then 
parallel computers tend to have enough memory to store all necessary data 
for this problem on each processor anyway). Even the greedy algorithm 
would benefit from a parallel computer, as each processor can be charged 
with the task of scoring only a subset of primers. It is not as critical in th.s 
case, though, since the computation times are not very high when us.ng the 

greedy algorithm. 

As is. this method is only capable of identifying known gene- 
variants If applied to a sample with a previously unknown variant, it is very 
probable that this new variant will be falsely identified as one of the known 
variants It would be very advantageous if this method could be 
augmented in some way to recognise this fact, and give a warning if there 
could be an unknown variant in the sample. It could be done by giving a 
warning when the signal pattern gained differs from the signal pattern from 
any known variants, but this might not be enough. There is no guarantee 
that the new variant could not differ in some place not affecting any of the 
existing primers, which would lead to the new variant being 
indistinguishable from any of the known variants. Some other way is 
probably needed as well. 
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APPENDIX 1 

Primer sequences for DBQ heterozygote typing 
Primers 'dqb1 -V to 'dqbl S placed In posiUons A3-A10. 
Primers 'dqbl -8' to 'dqbl -18' placed In poslltons BM1 1 . 
'0 5 Primers 'dqbl -Iff to "dqtal -30 placed In posihons C1-C12. 

Primers 'dqbl -3r to "dqtat ^7 placed in positions D1-D12. 
Primers 'dqb1 -43' to 'dqbl -54- placed in posiUons E1-E12. 

Primers 'dqbl -55" to 'dqbl -66' placed In positions F1-F12. 
10 Primers 'dqbl -67' to 'dqbl -76' placed in positions G2-G11. 
15 Primers 'dqb1 -77' to 'dqbl -84' placed in positions H3-H10. 

dqb1-1 NH2 - TCC ATC ACA GGA GTC AGA AAG GGC T 
dqb1-2 NH2 - GTG TGC AGA CAC AAC TAC GAG CTG G 
15 dqb1-3 NH2 . GCG GTG ACG CTG CTG GGG CTG CCT G 
dqb1 -4 NH2 - TAA TGA GGG GGG TGG ACA CGC C 
dqb1-5 NH2 - GCG GTG ACG CCG CTG GGG CCG CCT G 
dqbl -6 NH2 . GGA CAT CCT GGA GGA GGA CCG GGC G 
dqb1-7 NH2 . GTG GTG ACG CCG CTG GGG CCG CCj G 
20 dqbl1-8 NH2 - TCC GTC AAA GGA GTC /«3A MG GGC T 
dqb1-9 NH2 -GAT GTA TCTGGT CAC ACC CCG CAC G 
dqbl -10 NH2 - CCG AGT ACT GGA ATA GCC AGA AGG A 
dqbl -1 1 NH2 - GAT GTG TCT GGT CAC ACCCCG G^C G 
25 dqbl -12 NH2 - GGG TGG ACA CAA CGC CGG CTG TCT C 

25 dqb1-13 NH2 - GGG TGG ACA CAA CGC CGG TTG TCT C 
dnb1-14 NH2 - CTT CTG GCT ATT CCA GTA CTC GGC G 
dqb1-15 NH2 - ?ic CGG GCG GTG ACG CTG CTG GGG C 
dqbl -16 NH2 - GCT TCG ACA GCG ACG TGG GGG TGT A 
dqb1-17 NH2 - GCT GTT CCA GTA CTC GGC GCT AGG C 
30 dqb1.18NH2.CTTCTGGCTGTTCCAGTACTC^^^^^ 
30 dob1-19NH2-ACCGTGTCCAACTCCGCCCGGGTCC 

tt lll NH2 . CAC AAC GCC GGT TGT CTC CTC CTG G 
dqb1-21 NH2 - CTC CTC CTG GTC A^^ CCG CCA C 
dqb1 -22 NH2 - CCA GGA TCT GGA AAG TCC AGT CAC C 
35 dqb1-23 NH2 • GAG CGC GTG CGT CTT GTA ACC AGA T 
5qb1.24 NH2 - GAC ATC CTG GAG AGG AAA CGG GCG G 
dqb 1 -25 NH2 - AGA GAC TCT CCC GAG GAT TTC GTG T 
dqb1-26 NH2 - TAG TTG TGT CTG CAC ACC CTG TCC A 
dqb1-27 NH2 - ACG TAC TCC TCT CGG TTA TAG ATG T 
dqbl -28 NH2 - GCT TCG ACA GCG ACG JpG AGG TGT A 
dqb1.29 NH2 - TCC GTC CCA TTG GTG MG TAG CAC A 
dqbl -30 NH2 - TGA TAA GGC CCA GCC CGA GGA AGA T 
dqb1-31 NH2 - GGG TGG ACA CAA CGC CAG TTG TCT C 
40 dqb1-32 NH2 - GGG TGG ACA CAA CGC CAG CTG TCT C 

45 dqb1-33 NH2 - GAC AGC GAC GTG GAG GTG TAC CGG G 
dqb1-34 NH2 - TCC GTC CCG TTG GTG MG TAG (y^C A 
dqbl-35 NH2 - GCA CGA CCT TGC AGC GGC GAC CCC A 
dqbl-36 NH2 - GAA CAG CCA GAA GGA AGT CCT GGA G 
dqb1-37 NH2 - CTT CTG GCT GTT CCA GTA CTC GGC A 
50 dqb1-38NH2-AACGCCAGCTGTCTCTTCCTGGTCA 
45 dqb1-39NH2-GAGAGGACCCGGGCGGAGTTGGACA 

dqbMO NH2 • GCA GGC GGC CCC AGC GGC GT^^^^ 
dqbl^l NH2 - GTC GCT GTC GAA GCG CAC GTC CTC C 
dqb1-42 NH2 - CTC TGT CCT GGA TGG GGT CGC CGC T 
55 dqb1 -43 NH2 - ACG GGA CGG AGC GCG TGC GTTATG T 
dqbl ^ NH2 - GAA GTA GCA CAT GCC CTT AM CTG G 
50 dqbM5 NH2 - TCG GTG GAC ACC GTA TGC AGA CAC A 

^ dqb1 46 NH2 . GGA M CGT GTA CCA GTT TAA GGG C 

dqbM7 NH2 - ACG TAC TCT TCT CGG TTA TAG ATG T 
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dqb1-48 NH2 - GAG AGG ACC CGA GCG GAG TTG GAC A 
dqb1-49 NH2 - ACC CCA GCC TCC AGA GCC CCA TCA C 
dqb1 -50 NH2 - CAA C6G GAC GGA GCG CGT GCG GGG T 
dqb1 -51 NH2 - ACA TCT ATA ACC GAG AGG AGT ACG C 

5 dqbl -52 NH2 - GAA CAG CCA GAA GGA CAT CCT GGA G 
dqb1.53 NH2 - CCT TCT GGC TAT TCC AGT ACT CGG C 
dqb1 -54 NH2 - TTA AGG CCA TGT GCT ACT TCA CCA A 
dqb1-55 NH2 - TTC AGA TTG AGC CCG CCA CTC CAC G 
dqbl -56 NH2 - ATC TGG TCA CAA GAC GCA CGC GCT C 

10 dqb1 -57 NH2 - AGT AGC ACA GGC CCT TAA ACT GGT A 
dqb1-58 NH2 - ATG TAT CTG GTC ACA CCC CGC ACG A 
dqb1.59 NH2 - ATC TGG TCA CAT AAC GCA CGC GCT C 
dqb1-60 NH2 - ATC AAA GTC CAG TGG M CGG AAT G 
dqb1-61 NH2 - ACG TGG GGG TGT ATC GGG TGG TGA C 

15 dqb1 -62 NH2 - ATC AAA GTC CGG TGG M CGG AAT G 
dqb1-63 NH2 - GTATCT GGT CAC ACC CCG CAC GAG C 
dqb1-64 NH2 - CGC TGT CGA AGC GCA CGT CCT CCT C 
dqb1-65 NH2 - GGA M CGT GTT CCA GTT TAA GGG C 
dqb1-66 NH2 - TGT GGG CTC CAC TCT CCT CTG CAA G 

20 dqb1-67 NH2 - ACG TCC TCC TCT CGG TTA TAG ATG T 
dqb1-68 NH2 - TTG CAG CGG CGA CCC CAT CCA GGA C 
dqb1-69 NH2 - GAA GTA GCA CAG GCC CTT AAA CTG G 
dqb1 -70 N H2 - GAA GTA GCA CAT GGC CTT AAA CTG G 
dqb1-71 NH2 - TCG ACA GCG ACG TGG GGG TGT ACC G 

25 dqb1-72 NH2 - TCG ACA GCG ACG TGG GGG AGT TCC G 
dqbl -73 NH2 - TGT GGG CTC CAC TCG CCG CTG CAA G 
dqb1-74 NH2 - CGG CGT CAG GCC GCC CCT GCG GGG T 
dqb1 -75 N H2 - TCG ACA GCG ACG TGG AGG TGT ACC G 
dqb1-76 NH2 - CCG TTG GAG GCT TCG TGC TGG GGC T 

30 dqbl -77 NH2 - CGG TGA CCC CGC AGG GGC GGC CTG A 
dqb1 -78 NH2 - ATG GGA CGG AGC GCG TGC GTT ATG T 
dqb1-79 NH2 - CGG TGA CGC CGC TGG GGC GGC TTG A 
dqbl-80 NH2 - ACG GGA CGG AGC GCG TGC GTC TTG T 
dqbl -81 NH2-TGATAAGGC CAA GCC CAA GGA AGA T 

35 dqb1-82 NH2 - GAG ACT CTC CCG AGG ATT TCG TGT A 
dqbV83 NH2 - CGT CGC TGT CGA AGC GCA CGT CCT C 
dqb1-84 NH2 - GAC TCT CCC GAG GAT TTC GTG TAG C 
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Homozygotes 



(From CFT if available, otherwise greedy algorithm). 
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I I 1 1 1 1 



mm 
mm 



mm 
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1 1 1 III 

mm 



TTTGCCCAGGGCACAG 
TTTAAGGAAAAGGCTC 
TTTTGGATCTGGACAA 



TTTCTGGCCCAGCTCC 



TTTTTGTACAGACCCA 
TTTAGGGGACCCTGTG 
TTTGGCGGACCATGTG 



TTTCTGCTCATCTTCA 
TTTGTCAACTTATGCC 
TTTTCAGGCCGCCAAT 
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TTTT 

mr 



TTTT 



TTTT 
TTTT 



TTTT 



TTTT 



TTTT 



TTTT 



TTTTT 



mr 



TTTT 
TTTT 



TTTT 

rrrr 



TTTT 



TTTT 
TTTT 



TTTT 
TTTT 



m 



TTTT 



TTTT 



TTTT 
TTTT 



TTTT 
TTTT 



TTTT 



TTTT 



TTTT 

mr 



TTTT 
TTTT 



TTH 
DQA1 



TTTT 



TTTT 
TTTT 



TTTT 
TTTT 



TTT 



TTTT 



TTTT 



TTTT 



TTTT 



TTTT 



TTT 
TTT 



TTTT 
TTTT 



TTTT 



TCAACCGGGAGGAG 
TGGCCTGACGAGGA 
TCAACCTGGAGGAG 
TTCCAGTACTCCTC 
TTGCCGTAACTGGT 
TTGGGGCGGCCTGA 
TGCGCGTACTCCTC 
GGACAGGAGGAA 
TCACAGGAGGAGCA 
TTTGCTCCTCCTGT 



TGGCAATGCCCGCT 

TGGCACTGCCCGCT 

TAGAGAATTACGTG 

TTCCAGAGAATTAC 

TAACTACGAGCTGG 

TGGTCATGGGCCCG 

TTGACCCTGCAGCG 

TTACACGTAATTCT 

TGTAACTGGTACAC 
TCTGACGAGGAGTA 
TTTACCTTTTCCAG 
TCCTGGAAAAGGTA 
TGAGAATTACCTTT 

GCCTGACGA6GAG 



TACTGGTGCACGTA 

nCCTCCAGGATGT 

TCGGGAGGAGCTCG 

TAGCCAGAAGGACA 

TCAGCCAGAAGGAC 

TAGTGCCGGACAGG 

TATTGCCGGACAGG 

TCCTGCAGCGCCGA 

TAGAGAATTACCTT 

TGGACTCGGCGCTG 

TACTACGAGCTGG6 

TGCTTCGTGCTGGG 

TGTCCCTGGTACAC 

TGCGCTGCAGGGTC 



TACATCCTCATCTG 

TACACCCTCATCTG 

TCAAGTTTACACCA 

TCAGCCACAATGTC 

TTCCAAGTCTCCCG 

TCGGGAGACTTGGA 

TAATTCATGGCTGT 

TACAATCCCAGGGC 

TACAACCCCAGGGC 



TGTGGGCATTGTGG 

TCCAACACCCTCAT 

TGGCCCACAGACAA 

TCATGGGCATTGTG 

TGGCCTGGATGAGC 

TAGGCTCATCCAGG 

TCAACACCCTCATT 

TAGCACTGGGGACT 

TAAGGGCCATTGTG 
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T 1 1 I 1 I 1 1 I I I I AAATTCATGGGTG 
I Ml I I m I I I C ACCATAAGAGGC 
TTTTTTTTTTTTCACCACAAGAGGC 
I n 1 1 i I I M I I CACGGTAAGAGGC 
I I I I I I Ml 1 I H ICCTCCCrrCTG 
I I I 1 1 1 1 II i I I lA ACTCTCCTCAG 

I I 1 I I I II I III l - A AATCTCATCAG 

I I I I n I I I 1 1 I CTCCTCCCTTCTG 

DQB1 



20 



25 



10 ll llllinill ATCTTGCAGAGGA 
15 I I I M 1 11 H I r CCTCTCGAGGATG 

I 1 1 I I I H I 11 I GGGTCACCGCCCG 
I I I M I I I I I I I GGGAGTTCCGGGC 
rrTTTTTTTTTTCGCTCGGGTCCTC 
15 I 1 1 I I I I I 1 H ICCAGTACTCGGCG 
I n I II I I I H I CTGGGGCCGCCTG 
I I m I n I I I lATGTCTACACCTG 

I I I I I I I t I i I l AAAGGGCTTCTGC 
n I I li I I I I I l AGCATCACCAGGA 

20 I I I I t I n I I I i GCCAGGAGGAGAC 

I I 11 I I H I I I l ACCAGGAGGAGAC 
1 1 1 1 1 1 1 1 1 1 1 IGGTTTCGGAATGA 
II ill IM H I I GGGTGTATCGGGT 

nullum i gtcggaaagggct 

2*5 I i 1 i I I 1 I I I I I IGGTTTCGGAATG 

ii iii i i i i i i i ccagtactcggca 
ttttttttttttagcgcacgatctc 
i ii i i h i ii i i gtctcttcctggt 
I I I I I I I I I I I icgtcaagccgccc 
30 I I I I I I I n I I igcgtcaagccgcc 
30 I U U I I I I H ICAAGGTCGTGCGG 

I I I I I I I I II IT CGGTTATAGATGT 

1 1 1 1 II I m 1 1 i gtaaccagacac 

I I 1 1 1 1 U I 1 1 T GTATGCAGACACA 

35 min i mi i cacaccccgcacg 
m I I II m 1 1 acaccccgcacgc 

35 DRB1 

llllll imil GCAAGTCCTCCTC 

[ I ll in i um i ctcctcccggt 

40 n II I I I I m I CCACAACCCGGTA 
M l in III I I I GGCCAGGTGGACA 
7 Ml I I II M l I GCGGTTCCTGGAG 
40 T M M I I I I I I rCAGCCAGAAGGAC 

III M II I M I I GACTCGCCTCTGC 
4 5 II I I m n Ml IC CAGGACTCGGC 

I I M I M Ml M GAAATAACACTCA 
M l M l I I I I M I GGAGGACAGGCG 
M I I M M l M l ACGTGGTCGGGTG 
n M II M I III l A CTCCAAGAAAC 

^5 50 m I n n 1 11 i a cggtgtccacct 

I I M l I M I III GGAGAGGTTTACA 

ttttttttttttccagtactcggca 
m i i i m n m i ggagtactctacg 

M n IM I I M I GTGTAAACCTCTG 
55 I I I 1 I I M I II I CGGTGCAGCGGCG 
II I Ml I M II r GGAGGAGTTCCTG 
M 1 1 1 I I I I I I I I GGAAGACGAGGG 
CAGGAGGTTGTGG 
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55 



I I 1 1 II 1 1 i n I GACAGGCGCGCCG 
I I I III I U I I ICCGTTCAGGAACC 
IIIIIIIIIII IGGAATCCTCrrGG 

I I 1 1 1 1 1 m 1 1 GCCACAAGAAACG 
I 1 1 1 1 1 I 1 1 1 1 lACGTTTCTTGGAG 
I 1 1 I 1 1 I 1 1 1 1 1 CGGACTCCTCTTG 

I I I 1 1 1 1 1 1 I I I l ACGGGTGAGTGT 
rCCAGGAGGAGTTC 
rGTAATTGTCCACC 

mini mil I CGTAGCGCGCGT 
III III I I III lAAGATGCATCTAT 
I II I II I I I I H lA CGTCTGAGTGT 
I I II H I I II I ICCAGTACTCAGCA 
111 M l 1 1 1 I I ICGTAGCGCGCGTA 
I I I I I 1 1 1 I II l ATCTCTCCACAAC 
I 111 I I I I 111 I GAGCTCCTCCTGG 

I III I I I 1 1 1 I l AACCAGGAGGAGT 
111 1 111 1 I 1 1 lAGGGCCCGCCTGT 
III I m 1 I I 1 IGGAGAGCTTCACA 
III m III I I IGGA6AGATTCACA 

I I III III 1 1 1 I I CACCGCCCGGTA 

III III I III I l AACTACCGGGrrG 
III I III I 1 I 1 ICCAGTACTGGGCA 

DRB345 

II 111 m I m GTATCTGTCCAGG 

I I I I I I I 1 1 1 1 I GACTGGGGTGGTG 
TCTGTCGAAGCGCA 
rCTGTAAACCTCTC 
rCTGTGAAGCTCTC 

1 I I I 1 1 1 1 1 1 1 I CACCAGGGCCCGC 
I III 1 1 III III GGCCAGGTGGACA 
m 1 1 1 I III 1 IGCGGTTCCTGGAG 

III I I III I I 1 1 ICGAAGCGCGCGT 
1 1 1 I 1 1 1 III 1 1 lA ACCAGGAGGAG 
m il l 1 III 1 l ACGTGGTCGGGTG 
1 I 11 1 1 1 I I I 1 lAGGGCCCGCCTGT 
1 III I III 1 1 1 IGGGCCCGCCTGTC 
111 I m I I I 1 l AACTACGGAGTTG 

I III 111 I 111 IGGGGCCGGGCTGT 

imimiiiiGACCATGTTTcrr 

m i l i um I CTGTGCAGGAACC 
li m i l l im GGCCGGGCTGTTC 

I I 111 1 1 1 1 1 1 lACATCCTGGAAGA 

I I I I m I I I I I CTCACGAGTCCTG 

HLA-A 



mimmm CAGTCTGTGAGT 
ill ill I I 1 1 I lAGACGCATATGAC 
I 1 I i 1 1 I I 1 I I IGGACGCATATGAC 
GGTCGCCAGGTCC 
CCGCAGGCTCTCT 
CCTCCTCCACAT 
CCGAACCCTCGTC 
TTATTTCTCCACATC 
TTGGCGGACATGGCG 
TTCCAGAGCGAGGAC 
TTTTCACCACATCCG 
TTGGGAGCCTGCCCA 
TTTGATGTGGAGGAG 
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III I I I I It 1 1 r GGAGGAGGAACAG 
I HI t II m TT A GTCATATGCGTC 
lltHnillH GGTCTGCCCGAGC 
I I 1 1 I I I I II I rA AACCTGCCATGT 
5 M U I I I II I I TCCGGGACACGGAA 
1 I [ II I I I I I T TCGTCCTGGGGGGG 
TTTTTTTTTTTTCCGCTGCCAGGTC 
H I I I I I I M I l ATGCGTCCTGGGG 
I I 1 1 I I I t I 1 1 r ATGCGTCTTGGGG 
10 II I II t I m I T GGAGAAGAGATAC 
M I I I I I II H I GGGAGCCCGCCCA 
Tl n I H 1 1 1 1 1 CCGCAGGTTGTCT 
TTTTTTTTTTTTGCGCAGGTCCTCT 
7 I I I I I I I I 1 1 I GGGCGGGCTCTCA 
15 TTTTTTTTTTTTCCA6GACACGGAG 
I I I 1 I I II H t I CCGGCAGTGGAGA 
I II I HI I H I l AGGAGACAGGGAA 
l in il l m i l GTCAATCTGTGAG 
llimi l lll l A GAAGTGGGTGGC 
20 I I n I I 1 1 I I I I CAGGTAGGCTCTC 
I 1 1 I n I M I n CGGACGCCCCCAA 
I t I H I 11 I 1 I I I CAATCTGTGAGT 
U I I I I H II i I I GAAGGCCCAGTC 
T n 1 1 I I I I I I I CGTCGTAAGCGTC 
25 I I H 1 1 I I I I I l AACGAGAGCGAGG 
TTTTTTTTTTTTTGACGGTCATGGC 
in i l ll ll l lN GGACCTGGCGAC 
TTTTTTTTTTTTGAGAGCCCGCCCA 
TTTTTTTTTTTTTCATATTCCGTGT 
30 TTTTTTTTTTTTGGGAGACACGGAA 
l l li nil ll ll GTCGACTCGGTCA 
II IIII I I I UI CCGTGTCTCCCCG 
IHnill l IH GCTGCCACGTGGG 
n I I I I I t I I I I CGAACTGCGTGTC 
35 I I I I I I I I I I H GGTAGGCTCTCAA 
TTTTTTTTTTTTAGGTCCACTCGGT 
U 1 1 I H HI I i GTCCTGGGGGGGT 
II I I II I IIHI GCTGCTCCGCCGC 
I I I I II n H I I GGGGCGCCATGAC 
40 n I I I I 1 1 I IT TGCGCGATCCGCAG 
T I I I I I n H I IGCACATGGCAGGT 
1 11 1 n I I 1 1 1 T A GGAGAAGAGATA 
n I I I I i I H IT AGGAGCAGAGATA 

I I I I 1 1 II H I r CCACTCCACGCAC 
45 I U I I I I I I I I r CCCGTCCACGCAC 

T H 1 1 I I I III I CACGTGCCATCCA 
H 1 1 I m H I TCCCGGCCCGGCAG 
TTTT7TTTTTTTCACGTCGCAGCCA 
TTTTTTTTTTTTACGTCGCAGCCAT 
50 mill UIIIIA CGTGGCAGCCAT 
TTTTTTTTTTTTATCCAGAGGATGT 

I I i n I U 1 1 1 I CGAGCTCCGTGTC 
i llllllllll lACCAGAGCGAGGA 
1 1 1 1 1 1 I 1 1 1 ITATGAACAGCACGC 

55 I I M H I I II M rCACACCCTCCAG 
I I I I I I I I II I T CTACGTGGACAAC 
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HLA-B 

I 1 1 I I I I I n I IGGATGGCGCCCCG 
I I 1 1 I t H H I I CGGCTCAGATCTC 

I I I I 11 I H I H IC GGGGCGCCGTG 
5 11 1 I i U I I n I CTCCACTGGTCCG 

TTTTTTTTTTTTTGTGTTGGTCTTG 
m I H i II 1 1 t GGGTATGACCAGT 

I I i 1 1 1 I II II I ICCAGGTGATGTA 
TTTTTTTTTTTTGTCGTGCTCCGCC 

10 111 II I H I I I I I GTAGTAGCGGAG 
TT I I I I 1 1 I 1 r TGCTCAGGTCCTCC 
TTTTTTTTTTTTACCAACACACA6A 
llll l l it i n iCCGTCGTAGGCGT 
n 1 1 I I I I I I I IGTGAGCGTGCGGA 

15 I I 1 1 I I 1 I I I I lACATCATCCAGAG 
I 1 1 I I I H I I I r GGTTCTCTCGGTA 
t I H H i I I I I n GATGTGTCTCTC 
I 1 1 H I I I I II I GCGCCATGACCAG 
I I I I U I H I I I GGCGTCCTGGTCA 

20 M n n I I I 1 1 l AGGAGGAGCTGAG 
TTTTTTTTTTTTGCGCCAGGCACAG 
1 1 1 1 I I I I I i I l AGGAGGGGCCGGA 

III I m I 11 I i CCGCTGCTCCGCC 
I I 1 I i I I I ill l A CACCATCGAGAG 

25 U 1 1 I 1 I I I II ICACACAGATCTAC 
I I 1 1 I II I II I IGGGCATGACCAGT 
I 1 1 I I I i 1 1 1 I I CACACAGATCTCC 

I I I I i n I I I I IGCGAGTGCGTGGA 

II II It I II I I I IGGTACCCGCGGA 
30 I H I I I H 1 1 I I CCTGTGCGTGGAG 

II II I n I I 11 lAGACACAGATCTT 
1 I 1 I M 1 1 II I I CAGCGACGCCACG 
1 I 1 1 1 I I I I I I I CGGGCCGGGACAC 
l l limnill CCCGTCCCAATAC 
35 1 I II n I I I I I I GGGCATAACCAGT 
I I I H 1 I I 1 1 I I GCCCCGCTTCATC 

I I 1 1 I I I ill I I CAGGAGCGCAGGT 
H I n I I 1 1 I I I CGTCCACGCACAG 

II I I I II 1 1 I I IGAGTCCGAGAGAG 
40 I I I I I I I I I i I I GACACAGATCTCC 

I I I II UHm iAACCAGTTAGCC 
i mm i l l l llAGGCGTGCTGGT 

I I I 1 I I 1 1 1 I I I GACCCTGCTCCGC 
I II i 1 11 1 I I I IGGGGCTCCGCAGA 

43 II I 1 11 ni l I I CCGGTCGCAATAC 
1 11 I I I II I I 1 1 GCGGGTCACGGCG 
m I H I 1 I I I l A GGGCCAGGGCTC 
111 I I 1 1 1 1 H T ATCCTCTGGAGGG 

I I I I I I HI I H GGCAGACGATGTA 
50 I 1 I I 1 I 1 1 1 I 1 lAGGCGGAGCAGGA 

1 1 1 1 1 1 1 1 H I I CAGCTGCTCCGCC 
I 1 1 I I I I 1 1 I I l ATCTGCGGAGCCA 
I U I 11 1 1 I I I I CGGAGCTGTGGTC 

I H I II 1 1 1 II I CGACCACAGCTCC 
55 I I 1 11 I 1 1 I 1 1 I GAAGAGTTCAGGT 

II I I II 1 1 I 1 1 1 CATGTCGCAGCCA 
11 1 1 1 I I n I H CTGGGCTGGCTCC 

I I I I I I 1 1 t 11 I CAACACACAGACT 
H l l l ini l HI GGCGGAGCAGGA 
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TTTTrrTTTTTTTATGACCAGGACG 
1 n M 111 I U I CCACTGCTCCGCC 
I H I t i II I II l ATGACCAGGACGC 
M 1 1 1 1 1 1 1 1 1 I GGAGGGGCCGGAG 
5 1 1 1 1 1 1 1 1 1 1 1 1 I GCGTGGACGGGC 
n I i I n I I I i lA GATCTGTATCTC 
n I I I I 1 1 1 II rG CGGGTCATGGCG 
T II 1 1 1 1 1 I 1 1 r CCGGGACATGGCG 
T M I I I I M I I I CCACAGCTGTCCA 
10 7 1 1 1 I I I I I i I r CGGGACATGGCGG 
1 1I I 1 1 1 I I I I I CCCGTCCACGCAC 
T I 1 1 1 1 I I I I I I GAAGTGGGAGCCG 
TIl llil l llllll CCCAATCCACC 
! I I 1 1 I I I I I rr CCCACGATGGGGA 
15 Ullll ll tl l lU CCCAGTCCACC 
I I i 1 1 I I I I I I f GAGATCTGAGCCG 
TII I IIII II I II CCACGCACTCGC 
I I I III I I I I I r GACAGCGACGCGA 
I 1 1 1 1 11 1 I I IT CGCCGCGGACACC 
20 1 1 I I H I I I I 1 1 GTAGGAGGAAGAG 
n i llUlllllU III I CCACCTGA 
niniltll l l CACGTCGCAGCCA 
I I I I I I It I I I I CAGGTCGCAGCCA 
niUI II IIII CGTAGCCCACTGC 
25 TTTTTTTTTTTTATCCAGGTGATGT 
TTTTTTTTTTTTTCCCAATCCACCG 
I I I HI i I I I 1 1 GGGCGCTTCCTCC 
TTTTTTTTTTTTCCCGCTTCATCGC 

I I H I I I 1 1 I I I CCCCGCTTCATCG 
30 I lill l lllll TCACACAGACTTAC 

T 1 1 1 I I I II I r r A GGACGGTTCGGG 
H II IIII I H ICCCCGAACGGTCC 
Tl I I I I I 1 1 I I rGAGCTCTTCCTCC 
n I I H I I I I I I GCTCCCGAGAGCA 
3^ 1 H I I I I I I I I l ACTCCATGAGGCA 
^ T il I I I I I m i GGTGTGGTGGTGC 
H I 1 1 I I 1 1 1 I n I GTCCAGAAGGC 

I I I H I I 1 1 I I I rGCCCGCGGAGGA 
n I I I I I 11 I I f GCCGCGGACAAGG 

40 M I I I n 1 1 I I I CCGCCTrGTCCGC 
H I i I I I I I I I I CGGGTACCACCAG 

HLA-C 

TTTTTTTTTTTTTGAGCTGGGAGCC 
I M n I I 1 1 1 1 I GGTGGAGGGCTCC 
45 1 I I I I I 1 1 I H I GGGTGCAGGGCTC 
T I I I I I I I I I I I GAGGCGGAGCAGC 
n I 1 1 1 1 1 1 1 I lA CGGCGGAGCAGC 
TH I I I I I I I I I GCGGCGGAGCAGC 
T j 1 1 I I il I 1 1 rA GGGCGCGGAACC 
50 I I I m 1 I I 1 1 I CGGCCCAGGTCTC 
1 I I I H 1 1 I I 1 1 I GGCTCCCAGCTC 
TTTTTTTTTTTTGCGCGCGGAACCC 
1 1 1 n 1 U I 1 1 lA CGGCTTCCATCT 
rnTTTTTTTTTGGTTCGGGGCTCC 
55 I I M I I I 1 1 1 1 l A CTCCACGCACAG 
1 I 1 1 I I I I I 1 1 I l -GGAGCAGGAGGG 
I I I I I I H I I I I GCGCGCAGAACCC 
IIHIIIIimi GAGTCTCTCATC 
l I llllllltll CCTGCAGCCCCTC 
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TTTTTTTTTTTTCCGCCGTGTCCGC 

I I I H I i 1 1 1 1 I CCGCTGTGTCCGC 

I n n n 1 1 1 1 1 i ccagaatatgta 

n I m i I I I I I CGGGGAGCCCCGC 
5 I i I I 1 1 1 1 i I 1 I GCCGTCGTAGGCG 

I II I I I 1 1 I I I I CCGCCAGGCACAG 
1 1 1 I m I I I I I GCGCCAGGCACAG 
1 1 1 1 1 1 1 i m I GTAGCCGCGCAGG 
I II I III m I I GCTGGACGCAGCC 

1 0 I I m I I I I H I I C CAGTGGATGTA 



I n n I m 1 1 1 gccgtgtccggag 

I II I 1 1 1 1 I I I IG AGGGGAGCCCCG 
I I I I I I I III I r CGTGTCCCGGCCT 

1 5 H I I 11 I III I I GGCATGACCAGTT 
1 1 1 I I M i I II I GGTATGACCAGTT 
I H H 1 1 I I I I I GACAACCAGGACA 
I I n I I I t III I GAATATGTATGGC 
I 1 n i 1 II I U I GACAGCCAGGACA 

20 I 1 1 1 H I m i I CTGGCTGTCCTGG 



I I 1 1 I I II I II l AGGGCCAGGGCTC 
I 1 1 1 I I I I i I II l A TAACCAGTTCG 
I n I H II I II I CATAGGAGGAAGA 
25 I H 1 1 1 I 11 I I I I GTGGAGACCAGG 
I I I III 1 1 I I I I I GCTCrrCTCCAG 
I I i I I I I I 1 1 I I GAAGAATGGGAAG 
I H I t il I I I II I GCGGAAACTGCG 

16S rRNA 

30 n 1 1 1 1 I 1 1 1 1 1 lA GCCGCCTGCGT 

I II I I II n I TrGGCCGCAAGGCTG 
n I It I n 1 1 1 I GAACTGCCGTTGA 

I I i 11 1 1 11 1 1 lA GACTGCCGCTGA 

I 1 1 I 1 1 I I 1 1 1 I I lATTCGGAATTA 
35 lllllllllllirrGCACCCCTTGT 

I I n H 1 1 i II ICGCGAGGTTGAGC 

III I I I I I I I I r TACCCCCCATTGT 
n I I I I I I I I I r CATTTGATACTGG 

I I 1 1 II I 1 1 n I 'GTGTGCCTAATAC 
40 n 1 I I I I I I I I I l ACGACTTAACCC 

I I I II 1 I I II I ICCCGGCCTTTGTA 
1 11 1 1 1 I n n I GGGCAAACTGGAG 

I I I n I 1 1 II I I GATTTGATCCTGG 

I 1 1 I 1 1 I li III I 'GACTCCCGAAGG 
45 I I I I I I 1 I 11 1 IGAAGTCGTAGCAA 
1 1 1 II I I I n I ICGCTGCAGAGATG 
II Ht ll HiniA CCCTACCTACT 
1 M I I I 1 1 U I I GAGGACCTTCGGG 
I n 1 1 I I I I I 1 1 A AGGGCCATTACC 
50 I I I I I I 1 I I I 1 1 GATAAACGCTGGC 
I I I I I I I il I I I GACTAGCTACTCC 
I 1 I n I 1 I I I I lA CATCCGGTCrrA 
1 I 1 1 1 1 I I 1 I I lATCGCAGGCCTTG 
I III I I I I I I I 1 I CACCAAGTCGCT 
55 I 1 1 1 I I 1 1 I 1 1 I ICCCTCCTTTCGG 
I I I I 1 1 I I I I I I I I TA AACGCTGGC 

I I II H I M I I IC GAAACCGCAAGG 
MI II I I I III I GCAAGCGTCCTCC 

I I I I I I I 1 1 1 I lACCAAGGACGTTT 




'CTCCTAGGACAGC 
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55 



60 



. XTAATACCCGGAG 
TTACTTTCAGTGGGG 
. . CTGCGTGAAGTCG 
TTAATAGCCCACCAA 
TTAACGGAAACGGGG 
GGATTGCACTCTG 
TTTAGCCTTGGGGAG 
TTCGCCGCATGGCTG 
TTGCATAAGGGGCAT 
TTTACCACATCTCTG 
GTTACCGCGAGGA 
TTGGCTrrCAGAGAT 
TTCGCTGCTTCGCTG 
TTTAGCGCTACCTTG 
TTGCACCACCTGTCA 
TTTGAGTTTTAACCT 
. 'CTAATACGGGATA 
TTAGGAGAAAGCTTG 
rAAGAGATTAGC 
TTGTAGCATTGTGAT 
TTAGGCTTTCCCCCA 
TTAGAAGTAGCTTGC 
'CGCGTATCATCG 
CAGAGATTAGG 
mCCGAAAGCGTGG 
TTTACAACCCGAAGC 
rQTGATGGCTCAG 
TTCGTAGGCTTGGTG 
■GTGGAATTCCACG 
TTACGGTTCCCGAAG 
TTAACTCGAGTGCGT 
TTTGATGTGCTATTA 
TTAAGCAGGGAGGAA 
CTGCTGCAGTGAA 
TTrTGGGATTAGCTC 
TTCCTTTGATACTGG 
. 'GGACGCTAGCGGC 
TTGTTTACTACCCAC 
TTCGCGATCTCTAGC 
TTTAGGCCGTTCCCC 



TTACGCGTTGCATCG 
'GCCCGTCAAGCCA 
TTAGTCCCCGCCATT 
CTAGCCGTAAGGG 
'GTCCTTCGGGGG 
TTAACCAACTCCCAT 
TTACTGTGGGTAATA 
'CTGAAAGAT6GCG 
TTCGAAAGCCAGGGG 
TTGTCCGGAATTCTG 
TTCAGAAGTGGGTAG 
TTTCAGTCCTCATGG 
TTGAAAGAAGCTTGC 
TTGACCACCTGTCAC 
'GGAACTGCAT 
TTACAGTTCCCGAAG 
■CTCATATCTCTAC 
TTTTCAGTGAGGAAG 
TTACTGTGAGGAAGG 
rCCCAGCCCGTAAG 
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11 I I 1 1 I I H I I CGTAGCCTTGGTG 
i I I I 1 1 I I 1 1 1 lA TGATGCGTAGCC 
•I III I II I I II IA GGCAGTGGCTCA 



10 



CAGGACTTAACCC 

1 1 1 I m I 1 1 I I GGCCAGGCCGTAA 

I I 1 1 1 1 1 1 1 H I ICCAACTTCGTGC 

III I I I I I 1 1 I IGAAGCGTGTGTGA 
1 1 1 1 1 1 I I n I I CTCCCCCGAAGGT 
m I II I 1 1 I I l ATGGGAGTTTGTT 

1 0 1 1 1 I 1 1 I I I 1 1 1 GTGTGCCGTTACC 

I m 1 1 I I 1 1 1 l AGCAGTGAGGAAT 
15 1 1 1 I I n I I 11 IGCCCCGGTTAACT 

I I I I I I 1 1 I 1 1 IGCACCGGCAGTCA 
1 1 I 1 1 i I I 1 1 1 I GGACCTTCCTCTC 

15 TTTTTTTTTTTTACCTAGGTGGGAT 
I n I II I I I I I lA ATAGCTAATACC 
1 1 1 11 I n I I U GCCATATCTCTAC 
1 1 1 1 1 1 1 I II I IGCCGGTGGGGTAA 

2^ 1 1 It 1 11 II U I lA CCCCACCTTCG 

20 11 11 U I I II I I CAAGGCCTGGGAA 
M 1 1 I I I I 1 1 1 ICAACCCTGGTGGC 
I I I I I H I I 1 1 I CTAGTCATCCAGT 
I I II I I II I I I IGGCTGCTGCCTCC 
I I 1 I I II I H I I CCCAGAGCTCAAC 

25 25 1 I I II I II I I 1 IGAAAGCTTGATCC 

I I I I 1 I II I I I lA ACACGCTGGCAA 
1 1 1 1 1 I II I n i GAGCTTGCTCCCC 

II II I n 1 1 11 lA TTTAGTTGAGCA 
1 1 I i I I I 1 I 1 1 1 CGACTTAGGCTCA 

30 1 II I I 1 1 I I I U I IGATGTGCTATT 
II I I I 1 I I I I I I C TTAGGTGCCAGC 
30 I I I I I I 1 m I I GGCTACAGATCGT 

I II I 1 1 1 111 I l A ACTTGCGTGCAT 

I I I I I I I I I I I I GCGATTACGTCAA 
35 1 1 I I I I I 1 1 II IGGACGTTGGCGGC 

1 1 1 1 I II I 1 1 1 I IGGTGGAGCATGT 

I 1 1 I 1 I I I I 1 I lATAAACCATGCGG 
ll ll l lll l ll lAAGAAGTGGGTAG 

I I I I I I 1 1 1 i I l AACAAGCTAATCC 
40 T I I I 1 1 I 11 1 1 I ICCATGGTTTGAC 

- l l l l l tlimi AGTAACTGCCGGT 
1 1 I I 1 1 I 1 1 1 I tCAAAAGGGGGCGT 
II IIIIIII I II GGCGCTTGCGCTC 
TTTTTTTTTTTTGCTACCTACGTGC 
40 45 I I I II I I I 1 1 II IGCGAGGTGGAGC 

1 1 I I I I 1 1 11 I I CGCGAGGTGGAGC 
I II n 1 1 11 1 1 IGCTACCTACTTCT 
I I I I I I I II I 1 I I lAACACATACAA 

I I I I 1 1 nil II I G TTGTGAAATGT 
50 l l limm i l CGTAAAACTCAAA 

I I I I I H I I i I I I CAAGGGGCAAGT 
45 l l ll limi ll l CCAACCTTGCGG 

I I 1 1 1 1 1 i 1 i I IGGAGGAACGTGGG 
llll llimHA TAAGCCTCTCAG 

55 H 1 1 1 I I I I I M lA TGCTAATCCCA 
1 I I 1 1 I I II I I IGATGCTAATCCCA 
I I 1 1 1 I I I I I I IGCCAGTGTTCGTC 
I I II I I 1 1 1 I 1 1 GTAAAGGTGGGGA 
1 1 1 1 I II 1 1 H I I lA ACACACCGCC 

60 I I I I 1 1 I { I I I ICCAAGGCGGTGAT 
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TTTTTTTTTTTTGCTACGGCTAACT 
TTTTTTTTTTTTAGTCGAGCACTCT 

I I 1 1 I I I I I 1 1 lAAGGGTAGCTAAT 

I I 1 1 1 1 I n H i G TGACAGTACGAG 
10 5 TTTTTTTTTTTTTGAAAGCACTTTA 

I I I I m I III I G GCGCAAGGCTTA 
III I I I IIII II GCCTAGGTGGGAT 

II i I II M m I GTCCCCACGTTCC 
II I I II 1 1 I I I I GGCCACAAGGGGA 

10 M I I I 1 1 I U I I CTAGCTGTAGGGA 
1 1 I II I I M I I I GTGGGCAGCAAGC 
^5 n I H 1 1 1 1 I 1 1 ICGAAAGATTAAA 

I I i l l II I I I I I GGAGTATGGTCGC 

I I I I I 1 M I I I ICGAGATGTGAAAG 
15 I I I I 1 1 I 1 1 1 1 I GGGCAGGCTAGAG 

II I I H I 1 1 n l ACCTCCTGAGCCA 
l lll l l l l llill CCACCGCTACAC 
20 nil mini IIIICAGTC7TGCG 

I I 1 1 1 1 I 1 1 M I CTTGACGGGCGGT 
20 I I I I 1 1 I I I I I lACGGTAAAAGATG 

M 1 1 1 1 1 1 1 1 1 1 I I CACCCTTGCGG 
Ml U 1 I I I I I I [A ACCAGAAAGCC 
I n II I I 1 1 I I f CAACCAGAAAGCC 

II I II 1 1 I I I I I GTGTCAAAGGCAG 
25 25 I III I I I I I I I I TA AGTCCGGATTG 

I II H I I 1 1 1 1 I GCGACATGCTGAT 
t I I I I I m 1 1 lA TCAGCCTGCCGC 
II I H I I I i I I I GTCGGTAGGGTAA 
I H I I I I I I I I I GTCGGTGGGGTAA 
30 I I I I II I I III ICAACTCATAAGGG 

I I I 1 1 I I 1 1 I I I I ICACTGCTTAAA 
30 millimil CGCCAGTCCCACC 

li milll l l lCTAGTCATAAGGG 
M I I I I I I I II I CACTGATrTGACG 
35 II 1 1 II I 1 1 I I IGGCCACACAGGGA 
l ll i m i lllllllC CCCCATTGT 
M 1 1 I I I I I 1 1 TT GACCAGAAAGGG 
35 I I II 11 I I I t I lACACTGGGGGATA 

l l ll ll l ll i m CAGCCGCCTTCG 
40 Ml I I I I I I I I I GTCGCCAGCTCGT 

I I I I I I I I I I I ICTCATATGAATTG 

I M M M m IN GTAAAGGGAGCG 

II II I I I I I I I I CGTAAAGGGAGCG 
n 1 1 II II 1 1 1 I GGCGGCTCCCTCC 

40 45 n II I I I I I I 1 1 CAGATGTTCCTCC 

n 1 1 I I I I I I I I GTCTCACGACACG 

mi i i i nm iCAGCCGccrACG 

M i m ill imi GTGCTAATACC 
I II I II M I I I IC TTGGAACTGCAT 
50 I M l 1 1 1 1 I I I lA GTACTCACCCGT 
M II I I 1 1 I H rA TTGCTCCATCAG 

I M II I I I I m IGATCCTGAGCCA 

I I I II I I I I I I lA GCAAGTAGAACG 
TTTTTTTTTTTTTGCAAQTAGAACG 

55 I I I I I 1 1 I 1 1 1 1 GATAACC GCAA GG 
M M I H I III r GCAAGCGT TTTCC 
I I II M I M 1 1 r GAATACCTCCTTT 
50 TTTTTTTTTTTTACAGAGCrrTACA 

I I I I I I 1 1 1 1 1 1 I GTCCTTCGGGAG 
60 n M I I II I I I l AGGCGGCTTGCTG 
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TTGCCCAGGGCACAG 
TTCTGTTGTTCTATG 
rrAAGGAAAAGGCTC 
TTATGAAGATGAGCA 
TTCACCCTCAGTGAC 
GTCAACTTATGCC 
TTGCAGGAAGAGGCT 
GTACAGACGC 
rrCGGTGTCCTTCTT 
TTGCAATGGGGAGCC 
TTTGGATCTGGATAA 
RGATGAAGATGAG 
GTTTGTACAGAC 
CGTTTGTACAGAC 
TTCTCAGGCCGCCAA 
TTCTCAGGCCACCAA 
TTATGTGGATCTGGA 
TTACACTCAGGCCGC 
TTCACACTCAGGCCG 
TTTCAGGCCACCAAC 
TTCGTCTGTACAAAC 
TTAGAACATCTGATC 
TTAGAACTGCTCATC 
TTTTGAATTTGATGA 
TTTTTTGAGTTTGATGA 
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I I 1 i I 11 I I I M CAACCGGGAGGAG 
I I I 1 1 i I I I 1 I ICAACCTGGAGGAG 
I 1 n H IN 1 TTC ATCCTGGAGGAG 
1 I i I I I 1 1 1 1 1 I IGCTGGGGGGTCA 
I I I I 1 1 i i I 1 1 I GGCCTGACGAGGA 
AACTACGAGCTGG 
CCAGAGAATTAC 
IUIIIIIIIIHGCCGTAACTGGT 
TnTCCAGTACTCCTC 
TTTAGTGCCGGACAGG 
TTTACCCCCCAGCAGG 
TTTAGAGAATTACGTG 
CCAGTACTCCGC 
mGCATTCCTGCCGT 
mCGGGAGGAGCTCG 
mCAGCCAGAAGGAC 
TTTATTGCCGGACAGG 
CTGCAGCGCCGAG 
TTTGCGCGTACTCCTC 
mACAGAATTACCTT 



TTTTTAAGTGTACCAG 

TTTATCCTGGAGGAGA 

TTTGGTCATGGGCCCG 

TTTGGGAGGAGTACGC 

TTTTGGGGCGGCICTGA 
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I 1 I I I I I I I 1 1 lAAAAGGTAATTCT 
TCTGCCGTAACTGG 
nTTTGTGTCTGCATA 
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TTGGCTGTTCCAGTA 
GTCCCTGGTACAC 
CCTGCAGCGCCGA 
CTTGGAGGGGGA 
TTGAGGTCCTTCTGG 
TTCAACCGGCAGGAG 
TTTGTGTCTGCATAC 
TTCGGGAGC^QTTCG 
TTTGACCCTGCAGCG 
TTCAGAGAATTACCT 
TTTGGGTAGAAATCC 
TTTTACGTGCACCAG 
TTCGCTGCAGGGTCA 
TTAGCCAGAAGGACA 
TTGTTCCAGTAGTCC 
TTGGCCTGCTGCGGA 
TTTGCAGCGCCGAGG 
TTACTACGAGCTGGT 
TTCTGGGGCGGCCTG 
TTACAGCGACGTGGG 
TTTGCCGGACAGGAT 
TTCTGCCGTCCCTGG 
TTCATGGGCCCGACC 
TTGTCCCATTAAACG 
TTGTAACTGGTACAC 
TTAAGGACCTCCTGG 
TTCTCCTGGAGGAGA 
TTGAGAATTACGTGT 
TTCCTGATGAGGTGT 
TTCACAGGAGGAGCA 
TTTGCCGTCCCTGGT 
TTGGGAGGAGTTCGC 
TTTGGACAGGAGGAA 
TTACCCTGCAGCGTC 
TTCCGCCCGGAACTC 
TTGCTGCAGGGTCAC 
TTACAGGACTATCCA 
TTGCGTACTCCTGCC 
TTCCGTAACTGGTGC 
TTGCAGGAATGCTAC 
TTCCAGGCAGCATTC 
TTAACCGGGAGGAG 
TTTGGCCTC.AGGCGGA 
TTACTACGAGCTGGG 
TTATGAGGTGTACTG 
TTATACATCTACAAC 
AACTGGTACACT 
TTCACGTAATTCTCT 
TTAGCATTCCTGCCG 
TTACTGGTACACTTA 
TTGGCAATGCCCGCT 
TTGCTTCGTGCTGGG 
TTCGCCCGGAACTCT 
TTACAGGACTGTCCA 
TTTCCTCCAG6AGGT 
TTCCTTCTGGCTGTT 
TTGTTCCAGTACTCC 
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TGTTCCAGTACACC 
TTTTTTCTCCTGTAGGAGA 
TTTACCTTTTCCAG 



rGGAGGAGTTCGTG 
II I I I IGAGGAGCTCGTGC 
7GCCGTAACTGGTG 
TGCCCGCTCCTCCT 
TCGTCCCTGGAAAA 



rCCGCTGCAGGGTC 
FAACCTGGAGGAGA 
mCCTGCCGTAAC 
TACGCTGCAGGGTC 
TCCACAGAATTACC 
TCCAGAGAATTACG 
TCGCCGAGTCCAGC 
rAACAGGCAGGAGT 
rrCCTCCAGGATGT 
TAACCGGCAGGAGT 
CTCCAGAGAATTA 



rGCCGTCCCTGGAA 
I II I II i ICCCCTCCAAGAAG 
TGCTGCCTGGGTAG 
I I 1 I III IICCAGTAGTCCTC 
TATTCCTGCCGTAA 
rCCTGGAAAAGGTA 
TCGTCCCTGGTACA 
CTCCTCCAGGAAG 
CTGATTCTGCGC 
TATCTCCCTGCTGG 
TGAAGGACAACCTG 
TCGTGCACCAGrrA 
TCGGACAGGGTATG 
TCGGACAGGATATG 
TGCACTCGGCGCTG 
rACACGTAATTCTC 
TCGTAACTGGTACA 
TAATGACCCCCCAG 



TTCTCTCCAGGAAG 

TCAGCGACGTGGGA 

TTCCTGCCGGTT6T 



TGAAGGACATCCTG 
IIIIIIGAAGGACCTCCTG 
nGTTCCAGTACAC 
rCAGAAGGACAACC 
fGCGTGATGAGGTG 
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TTCACAAGAGGCAAC 

TTCATAAGAGGCAAC 

TTGAACAC.AGGCAAC 

TTACATCCTCATCTG 

rrGAGTGCCCATTGC 

TTCAGCCACAATGTC 

TTACAATCCCAGGGC 

TTACAACCCCAGGGC 

TTGTGGGCATTGTGG 

TTATGGGCATTGTGG 

TTCCAACACCCTCAT 

TTAGACTGTGGTCTG 
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rCCAACATCCTCAT 
I I I I I I I I I I I I GGCCCACAGACAA 
I I I I I ICATGGGCATTGTG 
TTAACATCCTCATCT 



TTTT 
"TTTT 



TTTT 
TTTT 



TTTT 
TTTT 



TTTT 
TTTT 



TTTT 



TTTT 
TTTT 



TTTT 



TTTT 
TTTT 



TTTT 
TTTT 



TTTT 



TTTT 
TTTT 



TTTT 
TTTT 



TTTT 
TTTT 



TTTl 



TTTT 
TTTT 
TTTT 



TTTT 
TTTT 



TTTT 
TTTT 



TTCAACACCCTCATT 
TTGACTGTGGTCTGC 
TTAGCACTGGGGACT 
TTCTTAGATHGACC 



I I 1 1 lAGATTTGACC 
TTCGATGTTCAAGTT 
UCAATCCCAGGGCG 
TTCCTCGGATGATGA 
CCACATAGAACT 
TTAAATTCATGGGTG 
TTCAGCCACAATGCC 



TTCACCATAAGAGGC 
TTTTCCTCCCTTCTG 
TTAACTCTCCTCAG 
TTTAAATCTCATCAG 
TTCTCCTCCCTTCTG 
TTGTCAGCCACAATG 
TTTCATTCCTTCTTC 
TTCTTCCTCCCTTCT 
ATAACTCTCCTCA 
TTGAGGCTCATCCAG 
TTC.AGGCTTGTCCAG 
TTATGTTGACCACAG 
TTAGTGCCCACCACA 
TTGAACATCCTGATT 
TTGGACCTGGAGAAG 
TTCCCTCTGGCCAGT 
TTCCCTCTGGGR-AGT 



TTTTACACCGTAAGA 
TTAGAAGATTTGACC 
TTGAACTGGCCAGAG 
TTGCTACAACTCTAC 
CAGTCTTACGGTC 



1 1 I I i I I I I I I ICAGTCTTATGGTC 
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TTATCTTGCAGAGGA 
TTGGCTGGGGTGCTC 
TTGGGTCACCGCCCG 
TTCTGGGGCCGCCTG 
TTCTCGGCGCTAGGC 

•GTATCTGGTCACA 
TTAACTACGAGGTGG 
TTCCAGTACTCGGCG 
TTCGGTTATAGATGT 
TTGCAAGTCCTGGAG 
TTTQQACACAACGCC 

'CTGGGGCTGCCTG 
TTGGCCTTAAACTGG 
TTTGTGTCTGCATAC 

FGTCGGAAAGGGCT 
T6GGTGTATCGGGT 
TCCAGTAGTCGGCA 

rGTAGACATCTCCA 

TAGGAAACGGGCGG 
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IIIIllllllCACACCCCGCACG 



TTTlilillliCCGCTCGGGTCC 



I I I I I 1 1 I I lAGCATCACCAGGA 

II IIUIIUCCAGTTTAAGGGC 



IlilllllllATAGCCACAAGGA 
I I 1 1 1 1 1 I I IGTATGCAGACACA 
CCAGTACTCGGC 



I I 1 1 1 1 1 1 I lAGCGCACGATCTC 
I I I H I I I I I GGACATCCTGGAG 
TTTTTTTTTTTGGGGCTGCCTGA 



I I II I I I 1 1 1 GTCAGAAAGGGCT 
I I I I I I I I I I CAGGAGCCCTTTC 
I I I n II 1 II TGTCTCTTCCTGG 
TTTTTTTTTTACACCCCGCACGC 
TTTTTTTTTTTGGTTTCGGAATG 
I 1 1 I I I I 1 1 I ) lAACGGGACAGAGC 
GCTGGGGCCGCCT 
GAGGATTTCGTGT 
TTTTTT II I I 1 1 GAGAGGAGTACGC 
TTCACATCAAAGTCC 
TTGCCAGGA6GAGAC 



TTGTACTCGGCGGCA 
II III I IICGCCAGTTGTCTC 
rTAGGGGGGTGGACA 
TTAGATGTATCTGGT 
TTTGGGGGAGTTCCG 
TTTGTCTCCTCCTGG 
TTCACACTCTGTCCA 
GGAATGATCAGGA 
TTATGGGGTCGCCGC 
TTTCAGATCAAAGTCC 
TTAACGGGACCGAGC 
TTAGGAGTACGTGCG 
TTATGTGACCAGATA 



TTAGGGGCGGCCTGT 
rrCGCCGGTTGTCTC 
TTTGTAACCAGACAC 
rTGTGAAGTAGCACA 
TTAGCGGCGACCCCA 
TTCACACCCTGTCCA 
TTGTGTGACCAGATA 
TTTGGACCrrCCAGA 
TTATCGGGTGGTGAC 
TTGTTTAAGGGCCTG 
TTTGAAGTAGCACAG 
TTGCTCCAACTGGTA 
TTCCTTAAACTGGTA 
TTAGGAGGACGTGCG 
TTTCGTGCTGGGGCT 
TTCGCTGCTGGGGCT 
CCAAGGAAGATCA 
TTACCGCGCCGTGAC 



TTGCCCTTAAACTGG 
TTTGGTCACACCCCG 
rrGGGAGTTCCGGGC 
TTAGGAGGAGACAAC 
TTGGGTGGACACAAC 
CTGCTCGGTGAC 
TTTGGGGCGGCTTGA 
TTGCGCACGTCCTCC 
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M m I I I I I I I GCCTTAAACTGGA 
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ll ll ll ll l lilG TACCTGGACAGA 
GTTCCTGGAGAGA 
TTTACACTCATACTTA 
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TTT 
TTT 



TTT 
TTT 



TTT 



TT 



mr 



TTTACACTCAGACTTA 
TTTTCCTGGAGCAGGC 
TTTTCGAAGGGCGCGT 
TTTAATCTGCACAGAG 
TTTAGGGCCCGCCTGT 
TTTAGGACACTCTGGA 
TTTGTGTAAACCTCTC 
'CTGTCGAA6CGCA 
TTTGGGGCCGGGCTGT 
TTTTCTTCCAGGATGT 
TTTAACTACGGAGTTG 
TTTCAAGAAACATGGT 
TTTTAACCAGGAGGAG 
GAAGCTCTCCAC 



TTTGGGGCGGCCTGTC 
TTTGCGGCGCGCGTGT 
CTTGGAGCTG 



TTTTTCTCTTCCTGGC 
TTTAACTACGGGGTTG 
TTTGTATCTGATCAGG 
TTTGGCCAGGTGGACA 
TTTGCCCCAGCTCCGT 
TTTGGTTCCTGGAGAG 
TTTGTCGAAGCGCACG 
GTGTCTGCAGTAG 
TTTGCTCCACTTGGCA 
TTTTACGGGGTTGGTG 
rrrCGGTTCCTGCACA 
TTTTCCAGTACTCGGC 



TTTGTCCACCTCGGC 

I I I I I I 1 1 1 I I I ICTTCCTGGCCGT 
nu ll GGTGTCCACCAGG 
ACTCCGTAGTTGT 
I II II II CACTCAGACTTAC 
I 1 1 1 1 1 1 GATGCTAGAAACA 
I I I I I II GTGGAATGGAGAG 
TTTTAACCAAGAGGAG 
TTTGTTCCGGAATGGC 



I I I 1 I 1 II i I I IGTATCTGCAGTAG 
TTTACCTCCTGGTCTG 



TTTAGCCAACAGGACT 
TTTGCGGTTGCTGCAG 
TTTCGCGCCGCGGTGG 
•GTAAACCTCTCCA 
•CTGATCAGGCTCC 
TTTTCCAGGACTCGGC 



I 1 1 1 II II 1 1 I lAACCATTCACAGA 
ll llll li mi CGGGCCCTGGTGG 
I I 1 1 I 1 1 1 M I I GTTCCGGAACGGC 
1 I 1 1 I I I I I I I IGCGGCCCGCCTGT 
TTTTCCTGGAAGACAC 
TTTGCCGGGTGGAGAA 
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CTGCTCCAGGATG 
TTTCAACTACTGCAGA 
TTTGTACCTGGAGAGA 
TTTACCTCTCCAGTCG 
GTGAAGCTCTCCA 
TTTCCGCGGCGCGCGT 

CTGATCAGGTTCC 
TTTAATGGGACGGAGC 
TTTTATGGAAGTATCT 
CTGCAGTAGGTG 
TTTCGGGCCGCGGTGG 
CTGTGCAGGAACC 
TTTCCAAGAGGAGGAC 
TTTCAATTACTGCAGA 
TTTCACCTACTGCAGA 
CTGCCTGGATAGA 
rrTGTAATTGTCCACC 
TTTCACCAGGGCCCGC 
TTTTGCGGTACCTGGA 



TTTCCTGCAGCACCAC 
TTTGCGGCGCGCCTGT 
CCAGGACTCGGCA 
TTTGACACAACTACGG 
TTTGATACAACTACGG 
TTTACTCAGACTTACA 
GAGACTTACACA 
TTTTAC6GGGTTGTGG 
TTTGTAGTTGTCCACC 
TTTAACCAGGAGGAGT 
TTTAACGAAGAGGAGT 
TTTTCCACAGCCCCGT 
TTTCAGCCAGAAGGAC 
TTTGGAGGAGTTCCTG 



TTTGAACTCCTCCTGG 
TTTAACCACTCACAGA 
TTTGGCCGGGCTGTTC 
TTTCTCACGAGTCCTG 
TTTGTCGAAGCGCAAG 
TTTCCTCCTGGTCTGT 
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TTTCAGTCTGTGAGT 

TTCCGCAGGCTCTCT 

TTATGAGGTATTrCT 

TTGGACATGGAGGTG 

TTC-AGGTAGGCTCTC 

TTTACTCTTGGGGGC 

TTGGTCGCCAGGTCC 

TTGGGAGCCCGCCCA 

TTCCGCTGCTCCGCC 

TTTGAA6GCCCAGTC 

TTGCAGCCATACATC 

TTCCACTCCACGCAC 

rrCACGTCGCAGCCA 

TTGGTCTGCCCGAGC 

TTCAGGTAGACTCTC 

TTGGGAGACACGGAA 

TTCCCGTCCACGCAC 

rrGTCCACTCGGTCA 
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TTATCCAGAGGATGT 

TTCGCGATCCGCAGG 

TTCCGGGACAGGGAA 

TTGGAGGAGGAACAG 

TTAAGTGAAGGCCCA 

TTGGGGCTTGGGGAG 

TTCAGACTAACCGAG 

TTGTCCTGGGGGGGT 

TTCGTCGTAAGCGTC 



TTAGGTCCACTCGGT 
TTGGTAGGCTCTCAA 
TTGCGCGATCCGCAG 
TTGTGTCCTGGGTCT 
TTATCCAGATAATGT 
TTCCGTCGTAGGCGT 
CATATTCCGTGT 
TTCGGACCCCCCCCA 
TTGCCGCATGGACCG 
TTGCTGCTCCGCCGC 
TTAGCGCAGGTCCTC 
TTCTACCTGGATGGC 
TTGGTATTTCTTCAC 



TTATATGAAGGCCCA 
I U 1 I i I I I I I I CCGTGTCTCCCCG 
TTCCGGCAGTGGAGA 
TTCGGACGCCCCCAA 
CCGTGAGGCGGAG 
TTAGGAGACAGGGAA 
TTAGAGCGAGGACGG 
TTGCACATGGCAGGT 
TTCAGCTGCTCCGCC 
TTATGAACAGCACGC 
TTCCCGGCCCGGCAG 
TTGCAGCCTGAGAGT 
TTGACGGTCATGGC 



TTCCGTCGTAAGCGT 
TTGAGTATTGGGACC 
CTGGCCTGGTTCT 
TTTACCTCATGGAGTG 
TTAGCCGCCATGTCC 
TTCACGTGCCATCCA 
TTGGTCCCCAGGTTC 
TTAGGAGAAGAGATA 
CTGCTGCTCCGCC 
TTTGACCCAGACCAG 
TTCGGGCGGAGCAGT 
TTAGGTTCGCTCGGT 



TTCATATGCGTCCTG 

TTCGTCCTGGGGGGG 

TTGCACGTGCGTGGA 

TTGGTATTTCTACAC 

TTAGGAGCAGAGATA 

TTCCCGAACCCTCGT 

TTGCCACATGGGCCG 

TTAGCAGGAGGAGCC 



TTATCCAGATGATGT 

TTGGATGGGGAGCAC 

TTGCACTGGC6CTTC 

TTAGCTTGTAAAGTG 

TTGATAATGTATGGC 
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I i I I I I t 1 1 1 1 1 ICACACCCTCCAG 
CTACGTGGACAAC 
TTCGAGCGAACCTGG 
TTCGAGACAGCCTGC 
GGGCTACGTGGAC 
rrACCACCAGTACGC 
GAGGATGTATGGC 
TTGATCTCAGCCGCC 
TTGATCTGAGCTGCC 
TTGATGATGTATGGC 
TTATACCTGGAGAAC 
TTGATGTATGGCTGC 
CCGCAGGTTCrC 
TTGAGCAGAGATAAA 
TTGGGCTGGGAAGAC 
TTGATGGGCAGGACT 
"CACTTTCCCTGT 
TTCCCACGATGTGGA 
TTAGTCATATGCGTT 
TTGGCGGACATGGCG 
TTGCTCCGCCTCACG 
TTCGTCGTAAGCGTT 
GATCATGTTTGGC 



TTCACGGACGCCCCC 
TTGCTCCTCCTGCTC 
TTACTCACCGAGTGG 
TTAGTCATATGTGTC 
TTGGTCTGAGCTGCC 
CCCACTTGCGCT 



TTGCCGACTGACAGA 
TTGGCTCAC.ATCACC 
I i I I I I 1 I I I I IGCTCTTGGACCGC 
1 i 1 1 1 1 1 i f 1 1 IGAGAGCCTGCGGA 
I I 1 1 i I I I I I I IGGAACACACGGAA 
TTTTTTI I I I I ICGGAACACACGGA 
I 1 1 I I I 1 1 1 mCGTAAGCGTCCTG 
GCCGGTGCGTGGA 
TTGCCGCATGGGCCG 
7TCCAGAGCGAGGAC 
TTCCCAACGGGCCGC 
TTCGAGTGCGTGGAG 
TTGCGAACCTGGGGA 
TTCGGGTACCAGCGG 
TTTGAAGCGGGGCTC 
TTGGCGGCCCGTTGG 
■CTGGGTCAGGGC 
TTGCCTCATGGGCCG 
TTCCATCCCGCTGCC 
TTAGCTCAGACCACC 
TTGTCGTAAGCGTCC 



TTCCCGGCCGCGGGA 
TTGGTCCCAATACTC 
TTCGTCCCAATACTC 
TTGTTCTCACACCAT 
TTTCCTCTGGATGGT 
TTTCCCACTTGTGCT 
'CCTGACCCAGACC 
lAGAGCCCGCCC 
TTGAGTGCGTGGAGT 
TTTACATCATCTGGA 
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GATCCGCAGGTTC 
I I I I I I I I I I I I lA GAGCAGGAGAG 

I I I I I I I I I I I I CCTGGCAGCGGGA 
GATGGAGTGAGA 

CCGGCCGCGGGAA 
H I I 1 1 I m I I CCAGGACACGGAG 

I I n 1 1 1 1 1 U I CCGGGACACGGAG 
1 1 I t I I I 1 1 1 I IGCAGCCACACATC 
1 1 I I I I I I III I GGATGGTGTGAGA 

1 0 li m il l l lll AACATGATCTGGA 
TTTTTTTTTTTTTCCTCCTCCACAT 
IIIIIIIIIIIII GGGCGGAGCAGT 
1 1I I I II II II 1 1 GCAGGGGATGGA 

I I I I I I 1 1 I I n CGCAGGAAGCGCC 
15 1 1 I I I I I 1 1 I n GGCCGTCATGGCG 

I I I I I I I 1 1 I I l ATGCGTCCTGGGG 

II I I I I I I I II l A TGCGTCTTGGGG 
20 1 n I I i 1 1 1 U I I 1 1 CCCTGTCTCC 

1 1 1 1 I m i III I CAGGGTGGCCTC 
20 I I I I II III H I GAGGAGGAACAGC 
I 1 I I I H 1 1 I 1 1 GCGCAGGGTCGCC 
1 1 1 I I II 1 1 1 1 I CAGCCAAACATCC 
I II I H n I I I l ACTTCTGGAAGGT 
TTTTTTTTTTTTTCCTCTGGACGGT 
25 25 1 1 1 I I 1 1 I 1 II IGGAGAAGAGATAC 

I I 1 1 I 1 1 1 1 1 1 lATTCCGTGTCTCC 
TTTTTTTTTTTTTCAATCTGTGAGT 
I I 1 1 I I II I I I IGGCCCGTCGGGCG 
1 1 1 1 I 1 1 1 1 I I I CGGCGGACATGGC 
30 n I I I I I I I I I I lA GAAGCTGTGAG 
I I I I I I I I I I I ICGAACTGCGTGTC 
I 1 1 1 I 1 1 1 1 I I I CGAGCTCCGTGTC 
I I I I I I I I I I I lACTCCACGCACCG 
I I I I I I I I I I I ICTACGTGGACGAC 



30 



35 



HLA-G 



35 TTTTTTTTTTTTTGAGCTGGGAGCC 

] n I 1 1 1 1 I I I I ATCACAACAGCCA 
40 1 1 1 I 1 1 II II I lAGGCTCTCCGCTC 
I 1 1 1 I II 1 1 1 1 IGGAGTGGGAGCAG 
TTTTTTTTTTTTTCACACCCTCCAG 
I n I H I 1 1 I I lA CTCCACGCACAG 
I n I 1 1 I 1 1 1 I I GCCGTCGTAGGCG 

40 45 III 1 li mn ICGCGCAQAACCCC 

mi li um i a gtagccgcgcag 
1 i iii i i i 111 iggagcggacagcc 
ttttttttttttcaggtaggctctc 
t i m i i i 1 11 iggttcggggctcc 

50 1 1 I I 111 III I IGCCCCAAGCCCTC 
11 1 I III I I I I IGGGCATGACCAGT 
^5 I m I II I I II I GCGGCTCCGCGGC 

I III III I I 111 ICCAGTGGATGTA 

I I m 1 1 1 1 11 iggcatgaccagtt 

55 1 1 1 III 1 1 1 I I I CTCACTCGGTCAG 

I il l 1 1 I I I 1 1 i caagccctcctcc 

III M I ill I I I IAGTTTCCGC.AGG 
IIII II II I II ICAGGTCGCAGCCA 
HI I i I III I I ICACTGCGATGAAG 

60 I I m I III 1 I IGGTATGACCAGTT 
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60 



rTACAGCCAGGCCAG 

TTGAGGCGGAGCAGC 

TTTGGTTGTAGTAGC 

TTACCTGCGGAAACT 

TTCGGCCCAGGTCTC 

TTGCTGGACGCAGCC 

TTCAGGTTCCGCAGG 

TTCCGCCAGGCACAG 

TTCCTCCTACACATC 

TTACGGCGGAGCAGC 

TTAGCGCGCGGAACC 

TTTTCACTCGGTCAG 

TTACGCCGCGAGTCC 

TTTGGAGCAGGA.GGG 

TTGGGTATGACCAGT 



TTATACCTGGAGAAC 
TTGGGTTCGGGGCTC 
TTGAGGGCTAGGAGA 
TTATCTGAGCCGCTG 



TTCGCGGAGAGCCCC 
I I I I I I I ICCTGGCGCTTGTA 
I I I I I 1 I ICCTGCG6AAACTA 
I I I I I I lAGCGTCTCCTTCC 
IIIMIillGGCGGCCCGAAC 
TTATGATGTGAGACC 
CTCGGTGTCCTGG 
TTGTAGTAGCCGCGT 
TTAGGATGTGAGACC 
TTGGTAGGCTCTGTG 
AGCGTCTTCTTCC 
TTCATAGGAGGAAGA 



TTGACAACCAGGACA 
TTGCCGCGGGGAGCC 
'GGTGAGGGGCTCT 
TTCGAGGGGCTGCCA 
TTGGGTATAACCAGT 
TTTCCAGAATATGTA 
TTGGGTGCAGGGCTC 
TTCGCGCGGAACCCC 
TTTAGTAGCCGCGTA 
TTAGCTGCTCTCAGG 
TTACCGCACGAACTG 
CCGCAGGCTCACT 
TTGGTGTGAGACCCG 
GGAGCCCCGAAC 
TTAGCCGCGGGAGCC 
TTACTGCACGAACTG 
TTCCGCACGAACTGT 
TTGGTGCAGGGCTCC 



TTTTTTTGCAGCAGGAGCAG 
GAGTCTCTCATC 
CCGCCGTGTCCGC 
'CCACGCACAGGC 
I I I 1 1 I I I I I I lACTCGGTCAGCCT 
1 1 1 I I I 1 1 I 1 1 ICACACC^TCCAGA 
I I I I 1 n H M ICACACCCTCCAGA 
I I I I I i 1 1 I 11 IGCAGCAGGATGAG 
CAGCCACCACAGC 

CGTGGGTGGCCT 

I I I I I I I I I I I I lACGGCGGAGCAG 
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TTTCTCACACCATCCA 
TTTTGCG6CGGAGCAG 
TTTTCTGAGCCGCCGT 
TTTGGCGGAGCAGCAG 
TTTCCGCTGCGGACAC 
rrTTATAACCAGTTCG 
TTTCACATCCTCCAGA 
T7TCCGTGTCCGCGGC 
CGTGGACGACACA 
CCGCTGTGTCCGC 



TT 
TT 

TTTGAAGAATGGGAAG 
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CLAIMS 

5 1 . A method of identifying a set of extendible primers for use In 

the Identification, typing or classification of a nucleic acid of known 
sequence having known polymorphisms wherein: 
i) all possible nucleotide sequences of a chosen length of the 

nucleic acid are identified and their corresponding extendible primers, 
10 ii) at least one extendible primer is removed from the set 

wherein the at least one primer removed identifies a segment of the nucleic 
acid identified by at least one other primer. 

2. The method of claim 1 . wherein between steps i) and ii): 
15 ia) potential extensions for each primer are identified with 

respect to each nucleotide sequence, 

ib) for each extendible primer the identified potential extensions 

are compared to determine which pairs of sequences can be discriminated 
by the primer. 

20 

3. The method of claim 1 or claim 2. wherein a matrix of primers 
and pairs of primer extensions is prepared in binary form and is subjected 
to analysis by a set covering problem (SCP) algorithm. 

25 4. The method of claim 3, wherein a greedy algorithm is used. 

5. The method of claim 3, wherein a CFT algorithm is used 

which involves a Lagrangrian relaxation heuristic. 

30 6. The method of any one of claims 3 to 5, wherein a set of core 

primers is selected as a base for analysis by the SCP algorithm. 



wo 00/65088 



PCT/EPOO/03636 



61 - 



10 



7. The method of any one of claims 3 to 6, wherein the set of 

extendible primers identified by the SCP algorithm is subjected to a 
redundancy check. 



15 



8. A set of extendible primers, for use in the identification, typing 

or classification of a nucleic acid of known sequences having known 
polymorphisms, identified by the method of any one of claims 1 to 7. 



20 



9. 

10 array. 



The set of extendible primers of claim 8, in the form of an 



25 



30 



15 



10. The set of extendible primers of claim 8 or claim 9, for use in 
the identification, classification or typing of an organism, allele or gene 
selected from class 1 HLA, class 2 HLA and 16S rRNA. 

11. The set of extendible primers of any one of claims 8 to 10, 
wherein the primers are arrayed on a surface of a support in such a way 
that recognisable patterns are formed with different types or alleles. 



35 



40 



45 



50 



20 12. A set of extendible primers, for use in the identification, typing 

or classification of a human leucocyte antigen (HLA) gene as indicated, the 
set comprising about the number of primers indicated and being capable of 
distinguishing about the number of alleles indicated: 





HLA gene 


Number of 
Alleles 


Number of 
Primers 


Class 1 


HLA-A 


91 


172 




Hl^-B 


200 


<1000 




HIA-C 


47 


94 


Class 11 


DPA-1 


11 


26 




DPB-1 


74 


130 




DQA-1 


17 


130 




DQB-1 


34 


84 




DRB-1 


192 


<1000 




DRB345 


35 


94 
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13. A set of extendible primers, for use in the identification, typing 

or classification of 16S rRNA, wherein set comprises about 210 primers 
and Is capable of distinguishing at least about 1207 different sequences. 

5 14. The set of extendible primers of daim 12 or claim 13, wherein 

the primers have variable segments substantially as set out in appendix 1 
or appendix 2. 

15. A method of identification, typing or classification of a nucleic 
10 acid of known sequence having known polymorphisms, by the use of the 

set of extendible primers as claimed in any one of claims 8 to 14, which 
method comprises applying the nucleic acid or fragments thereof to the set 
of extendible primers under hybridisation conditions, and effecting 
template-directed chain extension of extendible primers that have formed 
15 hybrids. 

16. The method of claim 15, wherein the set of extendible primers 
is provided in the form of an array, and template-directed chain extension is 
effected using labelled chain-temninating nucleotide analogues. 

20 

17. The method of claim 16. wherein template-directed chain 
extension is effected using four different fiuorescently-labelled chain 
terminating nucleotide analogues, and the results are analysed by total 
internal reflection fluorescence or confoca! microscopy. 

25 

1 8. The method of any one of claims 1 5 to 1 7. wherein the 
nucleic acid is a PGR amplimer. 

19. The method of any one of claims 15 to 18, wherein the 
30 nucleic acid is HI_A Class 1 or HI_A Class 2 or 16S rRNA or a PCR 

amplimer thereof. 
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20. The method of any one of claims 15 to 19. wherein a 

dUTP/uracil-DNA-glycosylase system Is used to break the nucleic acid into 
fragments. 

5 21. A kit for use in the identification, typing or characterisation of 

a nucleic acid of known sequence having known polymorphisms, 
comprising the set of extendible primers as claimed in any one of claims 8 
to 14. 



10 22, The kit of claim 21 , comprising also a pair of primers for 

effecting PGR amplification of the nucleic acid. 

23, An array of sets of extendible primers as claimed in any one 

of claims 8 to 14, for the simultaneous identification typing or classification 
15 of two or more different HLA genes. 



24. A computer readable storage medium having a program 

recorded thereon, wherein the program consists of instructional steps for 
identifying a set of extendible primers for use in the identification, typing or 
20 classification of a nucleic acid of known sequence having known 
polymorphisms, the steps comprising: 

i) identifying all possible nucleotide sequences of a chosen 
length of the nucleic acid and their corresponding extendible primers. 

ii) removing at least one extendible primer from the set wherein 
25 the at least one primer removed identifies a segment of the nucleic acid 

identified by at least one other primer. 
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25. Computer readable program implement consisting of 

instructional steps for identifying a set of extendible primers for use in the 
identification, typing or classification of a nucleic acid of known sequence 
having known polymorphisms, the steps comprising: 
5 i) identifying all possible nucleotide sequences of a chosen 

length of the nucleic acid and their corresponding extendible primers, 
ii) removing at least one extendible primer from the set wherein 

the at least one primer removed Identifies a segment of the nucleic acid 
identified by at least one other primer. 



